CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

Read pairs do not match??? #431

Closed windtalker6 closed 4 years ago

windtalker6 commented 4 years ago

my data were PE data with fq1 and fq2 for each sample. since the first 1-7bps are UMI, I ran the extract command as follows:

umi_tools extract --extract-method=string --bc-pattern=NNNNNNN --bc-pattern2=NNNNNNN -I test.pe_1.fq.gz -S test.extract.pe_1.fq.gz --read2-in=test.pe_2.fq.gz --read2-out=test.extract.pe_2.fq.gz

the error is:

Traceback (most recent call last): File "/home/wubin/miniconda3/envs/umi_tools/bin/umi_tools", line 11, in sys.exit(main()) File "/home/wubin/miniconda3/envs/umi_tools/lib/python3.7/site-packages/umi_tools/umi_tools.py", line 61, in main module.main(sys.argv) File "/home/wubin/miniconda3/envs/umi_tools/lib/python3.7/site-packages/umi_tools/extract.py", line 397, in main read1s, read2s, strict): File "/home/wubin/miniconda3/envs/umi_tools/lib/python3.7/site-packages/umi_tools/umi_methods.py", line 123, in joinedFastqIterate (pair_id, read2.identifier.split()[0])) ValueError: Read pairs do not match F300002275L1C001R0030000002/1 != F300002275L1C001R0030000002/2

as in pair-end data, one read name in fq1 is "F300002275L1C001R0030000002/1" and the paired one in fq2 is "F300002275L1C001R0030000002/2", is there anything wrong?

how can I solve this problem ?

windtalker6 commented 4 years ago

the error occured exactly at the beginning of the fq1 and fq2 file. the first read in fq1 is :

@F300002275L1C001R0030000002/1 CTTGCAAGTCACTAGAGTGGTGCAGCCTATTTTTTAAAAGTCGTGTGTGTCCTCTTACCCAGTACTTCCTCTTCATATGCACCTTCCGCGCTGCTACAGC + BEFEDF=FFDFAAFEC7D?FDDDF>BCFFFFFFFEDFFFDDEC<CEBFEDEFF?:FFE;EFEA=EFFDDFCEEEFFFFEDB7DFF:>EEEEFEDFAAD;E

the first read in fq2 is :

@F300002275L1C001R0030000002/2 CTTGCATTTACTGCAGGGGAAATAGTTGACATAAAGATGTACTTGCGTATTAGGCACTCCGATTTCAAAGATTTACTCGTATATTGGTCAAAGATATACT + EEEFEF'EEDEEFFBD>>A;DECFEDFE@EFEEFEE@EFAFDEEFCEDEFEFEADEEEDED9@DDFECEE<FCCFFFF<>EEFFEF<EE@FFACFFFDFF

all my PE data are like this , how can the umi_tools consider "Read pairs do not match"?

windtalker6 commented 4 years ago

In conclusion, my questions are as follows:

  1. I added arguments: --bc-pattern=NNNNNNN --bc-pattern2=NNNNNNN, for the first 7bp are UMIs in both fq1 and fq2, is there anything wrong with my arguments?

  2. what did it mean by "Read pairs do not match" ? should I modify the fq1 and fq2 before running umi_tools extract ?

TomSmithCGAT commented 4 years ago

Hi @windtalker6. The error occurs because the reads have /1 / /2 suffixes on them. If you install UMI-tools from the master branch, there's an option --ignore-read-pair-suffixes to avoid this error (see https://github.com/CGATOxford/UMI-tools/pull/421). The next release of UMI-tools will include this option.

windtalker6 commented 4 years ago

Hi @windtalker6. The error occurs because the reads have /1 / /2 suffixes on them. If you install UMI-tools from the master branch, there's an option --ignore-read-pair-suffixes to avoid this error (see #421). The next release of UMI-tools will include this option.

many thanks! so I have to modify the fq1 and fq2 before umi_tools extract?

I tried by replace the "/" with a space " ", then it can run without error report.

for example:

reads in fq1 are modified to be like this:

@F300002275L1C001R0030000002 1 CTTGCAAGTCACTAGAGTGGTGCAGCCTATTTTTTAAAAGTCGTGTGTGTCCTCTTACCCAGTACTTCCTCTTCATATGCACCTTCCGCGCTGCTACAGC + BEFEDF=FFDFAAFEC7D?FDDDF>BCFFFFFFFEDFFFDDEC<CEBFEDEFF?:FFE;EFEA=EFFDDFCEEEFFFFEDB7DFF:>EEEEFEDFAAD;E

and reads in fq2 like this:

@F300002275L1C001R0030000002 2 CTTGCATTTACTGCAGGGGAAATAGTTGACATAAAGATGTACTTGCGTATTAGGCACTCCGATTTCAAAGATTTACTCGTATATTGGTCAAAGATATACT + EEEFEF'EEDEEFFBD>>A;DECFEDFE@EFEEFEE@EFAFDEEFCEDEFEFEADEEEDED9@DDFECEE<FCCFFFF<>EEFFEF<EE@FFACFFFDFF

after extact, I get:

@F300002275L1C001R0030000002_CTTGCAACTTGCAT 1 GTCACTAGAGTGGTGCAGCCTATTTTTTAAAAGTCGTGTGTGTCCTCTTACCCAGTACTTCCTCTTCATATGCACCTTCCGCGCTGCTACAGC + FFDFAAFEC7D?FDDDF>BCFFFFFFFEDFFFDDEC<CEBFEDEFF?:FFE;EFEA=EFFDDFCEEEFFFFEDB7DFF:>EEEEFEDFAAD;E

and:

@F300002275L1C001R0030000002_CTTGCAACTTGCAT 2 TTACTGCAGGGGAAATAGTTGACATAAAGATGTACTTGCGTATTAGGCACTCCGATTTCAAAGATTTACTCGTATATTGGTCAAAGATATACT + EEDEEFFBD>>A;DECFEDFE@EFEEFEE@EFAFDEEFCEDEFEFEADEEEDED9@DDFECEE<FCCFFFF<>EEFFEF<EE@FFACFFFDFF

=======================================================================

you can see that "CTTGCAACTTGCAT" are the joint of "CTTGCAA" and "CTTGCAT", can I add a dash to separate these two by any argument?

TomSmithCGAT commented 4 years ago

You don't need to modify the fastqs, no. If you follow my suggestion, you can use the fastqs unmodified. As you've also found though, replacing the / with a space will also work, since any text after the first space in the read name is ignored.