CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

umi-tools extract #527

Closed wangjiawen2013 closed 2 years ago

wangjiawen2013 commented 2 years ago

Hi This is my output of umi-tools extract: 1649989047

The input reads eqaul "regex matches read1", while "Reads output" are less than them and "Filtered cell barcode" is less than "Reads output", could you explain the differecne among "regex matches read1", "Reads output" and "Filtered cell barcode" ?

TomSmithCGAT commented 2 years ago

regex matches read1 = Number of reads where the regex supplied to identify the cell barcode and umi from the read matches the read sequence.

In your case, the regex matches every single input read, which suggests to me you might not need to use a regex at all and a quicker string pattern may suffice. What was the regex you used?

Reads output = The number of reads output from extract. For the read to be output, it needed to have a cell barcode in the whitelist, hence reads output is lower than input

Filtered cell barcode = The number of reads which were filtered (e.g not output) because they did not match the cell barcode whitelist. Reads output + Filtered cell barcode = regex matches read1

wangjiawen2013 commented 2 years ago

umi_tools extract --stdin in.fq --stdout out.fq --extract-method=regex \ --bc-pattern='^(?P.{8})(?P.{8})(?P.{4}).{40}(?P.{4})(?P.+)' \ --log2stderr --whitelist=whitelist.txt 2> log.txt

My fastq structure: 8bp(barcode1)+8bp(barcode2)+4bp(umi1)+40bp(target sequence)+4bp(umi2)+others

TomSmithCGAT commented 2 years ago

Ah, if you need to discard bases after umi2, you will need to use a regex after all. Currently the string extraction method doesn't support discarding bases.

Did the above explanations all make sense?

wangjiawen2013 commented 2 years ago

Yes, thank you!