CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Custom whitelist #525

Closed MEFarhadieh closed 2 years ago

MEFarhadieh commented 2 years ago

Thanks for this great tool!

I have the list of final barcodes for my interest cell types. How can I use this list instead of UMI-whitelist.txt as input of extraction step? And how this txt file should be formatted?

wangjiawen2013 commented 2 years ago

I have the same question !

IanSudbery commented 2 years ago

You can just provide your own file to the --whitelist parameter of extract. The format of the file is described here: https://umi-tools.readthedocs.io/en/latest/reference/extract.html#whitelist

MEFarhadieh commented 2 years ago

Thank you so much!

I thought it had to be 4 tab-separated columns format.

wangjiawen2013 commented 2 years ago

I used one column excel, the barcodes were in the first column and there were not other columns. Then umi-tools didn't ran correctly, and the outputs are redirect to the whitelist, the resulting fastq file was empty. Then I filled the fourth column with all "1" and leave the second and third columns empty. Then it worked.

wangjiawen2013 commented 2 years ago

Another related question: Now I can use my customized whitelist. But my barcodes is specifically designed and there are at least 2bp differs among the barcodes, so I want to allow 1bp mismatch to the barcodes in the whitelist, how to do it ?

TomSmithCGAT commented 2 years ago

@wangjiawen2013 - I've address the question above in a separate issue #529

IanSudbery commented 2 years ago

@wangjiawen2013

I am a bit confused about what is happening here:

Then umi-tools didn't ran correctly, and the outputs are redirect to the whitelist, the resulting fastq file was empty.

When you say " the outputs are redirect to the whitelist", do you mean that your whitelist file was overwritten?

Can you show me the complete command you used?

wangjiawen2013 commented 2 years ago

umi_tools extract --stdin in.fq --stdout out.fq.gz --extract-method=regex \ --bc-pattern='^(?P.{8})(?P.{8})(?P.{4}).{40}(?P.{4})(?P.+)' \ --log2stderr --whitelist=whitelist.txt 2> {log} The resulting out.fq.gz was empty, and all the reads were in whitelist.txt ! I tried many times and it's indeed the case. Then I changed whitelist format as above, everything went well.

wangjiawen2013 commented 2 years ago

Besides, umitools cannot work if the there is a "/"(or "\", I forget wich one) in fastq names. such as: @abcdejfkalfjl:sjfkd \c 123456 ATGCATGCATGCATGC....... ! IC?CCCIIIIIIIII................ In this case, the barcode and umi will be extracted successfully, but will be discarded when counting. Then all the counts are zero.

IanSudbery commented 2 years ago

Your problem with the whitelist is very peculiar, and I can't work out how that would be possible. I'm guessing the problem with \ in the read name will be because this is often used to denote read1 or read2 in a read name:

@abcdejfkalfjl:sjfkd /1
ATGCATGCATGCATGC.......
!
IC?CCCIIIIIIIII................
@abcdejfkalfjl:sjfkd /2
ATGCATGCATGCATGC.......
!
IC?CCCIIIIIIIII................

Although I can't quite see how that would lead to the reads being dropped.

TomSmithCGAT commented 2 years ago

Yeah, doesn't make sense to me either. Extract drops the read pairs that don't match. which can be a problem, as per https://github.com/CGATOxford/UMI-tools/issues/325 and easily solved. But @wangjiawen2013 is saying the extract step is OK. So not sure why the issue is cropping up