Closed windtalker6 closed 4 years ago
all my PE data are like this , how can the umi_tools consider "Read pairs do not match"?
In conclusion, my questions are as follows:
I added arguments: --bc-pattern=NNNNNNN --bc-pattern2=NNNNNNN, for the first 7bp are UMIs in both fq1 and fq2, is there anything wrong with my arguments?
what did it mean by "Read pairs do not match" ? should I modify the fq1 and fq2 before running umi_tools extract ?
Hi @windtalker6. The error occurs because the reads have /1
/ /2
suffixes on them. If you install UMI-tools from the master branch, there's an option --ignore-read-pair-suffixes
to avoid this error (see https://github.com/CGATOxford/UMI-tools/pull/421). The next release of UMI-tools will include this option.
Hi @windtalker6. The error occurs because the reads have
/1
//2
suffixes on them. If you install UMI-tools from the master branch, there's an option--ignore-read-pair-suffixes
to avoid this error (see #421). The next release of UMI-tools will include this option.
many thanks! so I have to modify the fq1 and fq2 before umi_tools extract?
I tried by replace the "/" with a space " ", then it can run without error report.
for example:
after extact, I get:
@F300002275L1C001R0030000002_CTTGCAACTTGCAT 1 GTCACTAGAGTGGTGCAGCCTATTTTTTAAAAGTCGTGTGTGTCCTCTTACCCAGTACTTCCTCTTCATATGCACCTTCCGCGCTGCTACAGC + FFDFAAFEC7D?FDDDF>BCFFFFFFFEDFFFDDEC<CEBFEDEFF?:FFE;EFEA=EFFDDFCEEEFFFFEDB7DFF:>EEEEFEDFAAD;E
and:
@F300002275L1C001R0030000002_CTTGCAACTTGCAT 2 TTACTGCAGGGGAAATAGTTGACATAAAGATGTACTTGCGTATTAGGCACTCCGATTTCAAAGATTTACTCGTATATTGGTCAAAGATATACT + EEDEEFFBD>>A;DECFEDFE@EFEEFEE@EFAFDEEFCEDEFEFEADEEEDED9@DDFECEE<FCCFFFF<>EEFFEF<EE@FFACFFFDFF
=======================================================================
you can see that "CTTGCAACTTGCAT" are the joint of "CTTGCAA" and "CTTGCAT", can I add a dash to separate these two by any argument?
You don't need to modify the fastqs, no. If you follow my suggestion, you can use the fastqs unmodified. As you've also found though, replacing the /
with a space will also work, since any text after the first space in the read name is ignored.
my data were PE data with fq1 and fq2 for each sample. since the first 1-7bps are UMI, I ran the extract command as follows:
umi_tools extract --extract-method=string --bc-pattern=NNNNNNN --bc-pattern2=NNNNNNN -I test.pe_1.fq.gz -S test.extract.pe_1.fq.gz --read2-in=test.pe_2.fq.gz --read2-out=test.extract.pe_2.fq.gz
the error is:
Traceback (most recent call last): File "/home/wubin/miniconda3/envs/umi_tools/bin/umi_tools", line 11, in
sys.exit(main())
File "/home/wubin/miniconda3/envs/umi_tools/lib/python3.7/site-packages/umi_tools/umi_tools.py", line 61, in main
module.main(sys.argv)
File "/home/wubin/miniconda3/envs/umi_tools/lib/python3.7/site-packages/umi_tools/extract.py", line 397, in main
read1s, read2s, strict):
File "/home/wubin/miniconda3/envs/umi_tools/lib/python3.7/site-packages/umi_tools/umi_methods.py", line 123, in joinedFastqIterate
(pair_id, read2.identifier.split()[0]))
ValueError:
Read pairs do not match
F300002275L1C001R0030000002/1 != F300002275L1C001R0030000002/2
as in pair-end data, one read name in fq1 is "F300002275L1C001R0030000002/1" and the paired one in fq2 is "F300002275L1C001R0030000002/2", is there anything wrong?
how can I solve this problem ?