CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

umi extract #618

Closed robinycfang closed 9 months ago

robinycfang commented 11 months ago

Hi

I have PE reads, with UMIs on both 5' and 3' ends of both reads.

  1. I guess I only need to focus on the 5' UMI for the pair, as 3' end will have low quality and be trimmed off during my preprocessing of fastqs?
  2. My UMIs are 3mer or 4mer, followed by a T. I understand that umitools can only handle same length UMIs, so I was thinking retain T for 3mer and discard T for 4 mer, so eventually every UMI will be 4 mer. Do you think this can work?
  3. I tried above using regrex: umi_tools extract --extract-method=regex \ --bc-pattern="((?P<umi_1>^[ACGT]{3}[ACG])(?P<discard_1>T))|(?P<umi_2>^[ACGT]{3})" \ --bc-pattern2="((?P<umi_1>^[ACGT]{3}[ACG])(?P<discard_1>T))|(?P<umi_2>^[ACGT]{3})" \ -I test_R1.fastq.gz \ --read2-in=test_R2.fastq.gz \ --stdout=processed.1.fastq.gz \ --read2-out=processed.2.fastq.gz \ --log=processed.log but it gave me TypeError: can only concatenate str (not "NoneType") to str

Any comments would be appreciated!

IanSudbery commented 10 months ago

I'm guessing you are getting this error because your regexs have two different options in the them, and in one option you have a <umi_2> group, when there has not been a <umi_1> group collected.

I think for your use case, you can use something much simpler:

$ umi_tools extract --extract-method=regex \
                    --bc-pattern="(?P<umi_1>^[ACGT]{4})" \
                    --bc-pattern2="(?P<umi_1>^[ACGT]{4})" \
                     -I test_R1.fastq.gz \
                    --read2-in=test_R2.fastq.gz \
                    --stdout=processed.1.fastq.gz \
                    --read2-out=processed.2.fastq.gz \
                    --log=processed.log

or even:

$ umi_tools extract --extract-method=string \
                    --bc-pattern=NNNN \
                    --bc-pattern2=NNNN \
                     -I test_R1.fastq.gz \
                    --read2-in=test_R2.fastq.gz \
                    --stdout=processed.1.fastq.gz \
                    --read2-out=processed.2.fastq.gz \
                    --log=processed.log

should work.

TomSmithCGAT commented 9 months ago

@robinycfang - Closing now due to inactivity