epi2me-labs / pychopper

cDNA read preprocessing
Other
55 stars 9 forks source link

Question about the two detection methods #65

Open CHENAO-QIAN opened 1 month ago

CHENAO-QIAN commented 1 month ago

Hi,

I am trying the two methods: phmm and edlib. I found phmm has a high reproducibility, giving the same result between different runs. While edlib can give very different results. I am wondering if there is a recommendation regarding in which situation to use which method?

Thanks!

NanoCoreUSA commented 1 month ago

I've have also noticed this recently, specifically in regard to how it handles UMIs with the '-U' flag. I've posted some examples below to illustrate (ran on a FASTQ file with one read, with the last two lines i.e., quality scores removed for clarity).

Here's starting input file (i.e., input to pychopper) -

@89afb5e5-88ea-4097-b1f1-8d84d48f1d0e runid=6578f753d630b7dedc05b3d42e9ab21a8b375a15 sampleid=secondBatch_1 read=51236 ch=329 start_time=2023-01-24T04:42:25Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11
TTGTACTTCGGACATTCTGTTGGTGTTGCTCCGGATCGCCACGCCTACCGTGACGTCCTGCATCTATCGGAGGGAATGGATTTCTGTTGGTGCTGATATTGCTCCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCGCATCCCAACCCGCCGCCGCAGCCGCCTACAAACTGGTGCTGATCCGGCACGGCGAAGGCGCATGGAACCTGGAGAACCGCTTCAGCGGCTGGTACGACGCCCGGCCTGAACCAGCCGGACGAAATGAAGCGCGGCGAGGCGGCTACGAGATGCTGTTGCTGGTTGGCAACATCATTACTCGTTGAAGGCGATCCGGGCCCTCTGGACAGTGCTAGATGCCATTGATCAGATGTGGCTGCCAGTGGTGAGGACTTGGCGCCTCAATGAGCGGCACTATGGGGGTCTAACCGGTCTCAATAAAGCAGAAACTGCTGCAAAGCATGGTGAGGCCCAGGTGAAGATCTGGAGGCGCTCTATGATGTCCCACCACCTCCGATGGAGCCCGACCATCCTTTCTACAGCAACATCAGTAAGGATCGCGGGTATGCAGACCTCACAGAAGATCAGCTACCCTCCTGTGAGAGTCTGAAGGATACTATTGCCAGAGCTCTGCCCTTCTGGAATGAAGAAATGATTCCCCAGATCAAGGAGGGGAAACGTGTACTGATTGCAGCCCATGGCAAGCCTCCGGGGCATTGTCAAGCACTCGGAGGTCTCTCTGAAGGCTATCATGGAGCTGAACCTGCCGACTGGTATTCCCATTGTCTATGAATTGGGCAGGAACTTGAAGCCTGTCAAGCCCATGCAGTTCTGGGGGATGAAGAGACGGTGCCGCAAGCCATGGAAGCTGTGGCTGCCCAGGCAGAGCCAAGAAGTGAAGGCCGGCGGGGAGGATGCCTCCCCCAGGAGCACCCTCCCGTCTTGTCCTCATGCCCTCCCACCTGCACATGTCACACTGACCACATCTGTAGACATCTTGAGTTGTAGCTGCAGGCTGGGGGACCAGTGGCTCCCATTTTCATTTTAGCCATTTTGTCATGCCACCCACTCCCTTCATACAATCTAGTAAGAATAGCAGTTCTAGAGCACAGGTTCTCAGTCTAAGCTATGGAAAACACATATCCAACAGAGTTTAAAGTAGTGACTTGGGTTTTTGCGAGTGCTTTGTTTACTAAGGACTTTGGGGAGGAACCATGCTAAGCCATGACCAGTGAGGAGAAGCAACAGAGCCTGTCTGTCCCCATGAGCGGAGTCTGTCCTCTGCTCTTCTGCAGTCAGGTCACTGCCTACTGCCTGGGGGCTCTAGTCATTCCAGTGGAAGACGAATGTAACCTGCGTGGTGATGTGACAACTGTTTCCTCCTGACCCCAGAGGATCTGGCTCTAGGTTGGGATCAATCCTGAATTTCGTTATGTGTTAATTTACTTTTATTAAAAAAGTATAGTATATATAATACAAAACAATAACCCTTCTGGGGTTTCTTGTGGCAGATTTGAAATAGTCCCACATGTGGTCATCAGAAAATAAGCCATTCCTCATACCAATATAGGATCAGCTCCTTGACCTCTGAGGGGCAGGAGTGCTTCCTGGTGTGTGTATTAGAATCCCTTCCTGCCTTGTTTCATGGCAGTGAAATGCCTCTTGGTCCTGTCCAAGTGTATCTTTCACTGATTTCTGAATCATGTTTCTAGTTGCTTGACCCTGCCACATGGGTCCAGTGTTCATCTGAGCATAACTGTACTAAATCCTTTCCATATCAGTATAATAAAGGAGTGATGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGCTTGCGGGCGGCGGACTCTCCTCTGAAGATAGAGCGGCAGGCAAGTTCCATTCCCTCTATAGATGAAACGTCA

Running the following pychopper command (i.e., using 'phmm' model) returns this -

pychopper -U -k PCS111 -m phmm barcode11_combinedReads.head.fastq pychopper.phmm.fastq

@132:1882|89afb5e5-88ea-4097-b1f1-8d84d48f1d0e runid=6578f753d630b7dedc05b3d42e9ab21a8b375a15 sampleid=secondBatch_1 read=51236 ch=329 start_time=2023-01-24T04:42:25Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11 strand=+ umi=TTGCTCCCATTGGCATTGACCTTAAGCTTT
GTCGGTGCCGCATCCCAACCCGCCGCCGCAGCCGCCTACAAACTGGTGCTGATCCGGCACGGCGAAGGCGCATGGAACCTGGAGAACCGCTTCAGCGGCTGGTACGACGCCCGGCCTGAACCAGCCGGACGAAATGAAGCGCGGCGAGGCGGCTACGAGATGCTGTTGCTGGTTGGCAACATCATTACTCGTTGAAGGCGATCCGGGCCCTCTGGACAGTGCTAGATGCCATTGATCAGATGTGGCTGCCAGTGGTGAGGACTTGGCGCCTCAATGAGCGGCACTATGGGGGTCTAACCGGTCTCAATAAAGCAGAAACTGCTGCAAAGCATGGTGAGGCCCAGGTGAAGATCTGGAGGCGCTCTATGATGTCCCACCACCTCCGATGGAGCCCGACCATCCTTTCTACAGCAACATCAGTAAGGATCGCGGGTATGCAGACCTCACAGAAGATCAGCTACCCTCCTGTGAGAGTCTGAAGGATACTATTGCCAGAGCTCTGCCCTTCTGGAATGAAGAAATGATTCCCCAGATCAAGGAGGGGAAACGTGTACTGATTGCAGCCCATGGCAAGCCTCCGGGGCATTGTCAAGCACTCGGAGGTCTCTCTGAAGGCTATCATGGAGCTGAACCTGCCGACTGGTATTCCCATTGTCTATGAATTGGGCAGGAACTTGAAGCCTGTCAAGCCCATGCAGTTCTGGGGGATGAAGAGACGGTGCCGCAAGCCATGGAAGCTGTGGCTGCCCAGGCAGAGCCAAGAAGTGAAGGCCGGCGGGGAGGATGCCTCCCCCAGGAGCACCCTCCCGTCTTGTCCTCATGCCCTCCCACCTGCACATGTCACACTGACCACATCTGTAGACATCTTGAGTTGTAGCTGCAGGCTGGGGGACCAGTGGCTCCCATTTTCATTTTAGCCATTTTGTCATGCCACCCACTCCCTTCATACAATCTAGTAAGAATAGCAGTTCTAGAGCACAGGTTCTCAGTCTAAGCTATGGAAAACACATATCCAACAGAGTTTAAAGTAGTGACTTGGGTTTTTGCGAGTGCTTTGTTTACTAAGGACTTTGGGGAGGAACCATGCTAAGCCATGACCAGTGAGGAGAAGCAACAGAGCCTGTCTGTCCCCATGAGCGGAGTCTGTCCTCTGCTCTTCTGCAGTCAGGTCACTGCCTACTGCCTGGGGGCTCTAGTCATTCCAGTGGAAGACGAATGTAACCTGCGTGGTGATGTGACAACTGTTTCCTCCTGACCCCAGAGGATCTGGCTCTAGGTTGGGATCAATCCTGAATTTCGTTATGTGTTAATTTACTTTTATTAAAAAAGTATAGTATATATAATACAAAACAATAACCCTTCTGGGGTTTCTTGTGGCAGATTTGAAATAGTCCCACATGTGGTCATCAGAAAATAAGCCATTCCTCATACCAATATAGGATCAGCTCCTTGACCTCTGAGGGGCAGGAGTGCTTCCTGGTGTGTGTATTAGAATCCCTTCCTGCCTTGTTTCATGGCAGTGAAATGCCTCTTGGTCCTGTCCAAGTGTATCTTTCACTGATTTCTGAATCATGTTTCTAGTTGCTTGACCCTGCCACATGGGTCCAGTGTTCATCTGAGCATAACTGTACTAAATCCTTTCCATATCAGTATAATAAAGGAGTGATGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This identifies the UMI as umi=TTGCTCCCATTGGCATTGACCTTAAGCTTT and the new sequence begins with GTCGGTGCCG. Refering back to the original input file, it seems the 'phmm' model is actually removing the UMI from the read (as well as few extra nucleotides, i.e., 'GGGG'). See here, where the first highlight is UMI and the second is where the new sequence begins -

TTGCTCCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCG

In contrast, running this command (i.e., the 'edlib' model) returns this -

pychopper -U -k PCS111 -m edlib barcode11_combinedReads.head.fastq pychopper.edlib.fastq

@103:1885|89afb5e5-88ea-4097-b1f1-8d84d48f1d0e runid=6578f753d630b7dedc05b3d42e9ab21a8b375a15 sampleid=secondBatch_1 read=51236 ch=329 start_time=2023-01-24T04:42:25Z model_version_id=2021-05-17_dna_r9.4.1_minion_768_2f1c8637 barcode=barcode11 strand=+ umi=TTGCTCCCATTGGCATTGACCTTAAGCTTT
CCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCGCATCCCAACCCGCCGCCGCAGCCGCCTACAAACTGGTGCTGATCCGGCACGGCGAAGGCGCATGGAACCTGGAGAACCGCTTCAGCGGCTGGTACGACGCCCGGCCTGAACCAGCCGGACGAAATGAAGCGCGGCGAGGCGGCTACGAGATGCTGTTGCTGGTTGGCAACATCATTACTCGTTGAAGGCGATCCGGGCCCTCTGGACAGTGCTAGATGCCATTGATCAGATGTGGCTGCCAGTGGTGAGGACTTGGCGCCTCAATGAGCGGCACTATGGGGGTCTAACCGGTCTCAATAAAGCAGAAACTGCTGCAAAGCATGGTGAGGCCCAGGTGAAGATCTGGAGGCGCTCTATGATGTCCCACCACCTCCGATGGAGCCCGACCATCCTTTCTACAGCAACATCAGTAAGGATCGCGGGTATGCAGACCTCACAGAAGATCAGCTACCCTCCTGTGAGAGTCTGAAGGATACTATTGCCAGAGCTCTGCCCTTCTGGAATGAAGAAATGATTCCCCAGATCAAGGAGGGGAAACGTGTACTGATTGCAGCCCATGGCAAGCCTCCGGGGCATTGTCAAGCACTCGGAGGTCTCTCTGAAGGCTATCATGGAGCTGAACCTGCCGACTGGTATTCCCATTGTCTATGAATTGGGCAGGAACTTGAAGCCTGTCAAGCCCATGCAGTTCTGGGGGATGAAGAGACGGTGCCGCAAGCCATGGAAGCTGTGGCTGCCCAGGCAGAGCCAAGAAGTGAAGGCCGGCGGGGAGGATGCCTCCCCCAGGAGCACCCTCCCGTCTTGTCCTCATGCCCTCCCACCTGCACATGTCACACTGACCACATCTGTAGACATCTTGAGTTGTAGCTGCAGGCTGGGGGACCAGTGGCTCCCATTTTCATTTTAGCCATTTTGTCATGCCACCCACTCCCTTCATACAATCTAGTAAGAATAGCAGTTCTAGAGCACAGGTTCTCAGTCTAAGCTATGGAAAACACATATCCAACAGAGTTTAAAGTAGTGACTTGGGTTTTTGCGAGTGCTTTGTTTACTAAGGACTTTGGGGAGGAACCATGCTAAGCCATGACCAGTGAGGAGAAGCAACAGAGCCTGTCTGTCCCCATGAGCGGAGTCTGTCCTCTGCTCTTCTGCAGTCAGGTCACTGCCTACTGCCTGGGGGCTCTAGTCATTCCAGTGGAAGACGAATGTAACCTGCGTGGTGATGTGACAACTGTTTCCTCCTGACCCCAGAGGATCTGGCTCTAGGTTGGGATCAATCCTGAATTTCGTTATGTGTTAATTTACTTTTATTAAAAAAGTATAGTATATATAATACAAAACAATAACCCTTCTGGGGTTTCTTGTGGCAGATTTGAAATAGTCCCACATGTGGTCATCAGAAAATAAGCCATTCCTCATACCAATATAGGATCAGCTCCTTGACCTCTGAGGGGCAGGAGTGCTTCCTGGTGTGTGTATTAGAATCCCTTCCTGCCTTGTTTCATGGCAGTGAAATGCCTCTTGGTCCTGTCCAAGTGTATCTTTCACTGATTTCTGAATCATGTTTCTAGTTGCTTGACCCTGCCACATGGGTCCAGTGTTCATCTGAGCATAACTGTACTAAATCCTTTCCATATCAGTATAATAAAGGAGTGATGTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

The identified UMI is the same, however now the new sequence still contains the UMI (albeit a truncated form missing the first five nucleotides) and also includes the 'GGGG' spacer the was trimmed from the 'phmm' model. See here from the first line of the pychopper output file -

CCCATTGGCATTGACCTTAAGCTTTGGGGGTCGGTGCCG

So, the two models are definitely behaving differently, with the 'phmm' removing the UMI as well extra nucleotides, and the 'edlib' keeping the UMI in the sequence but with some nucleotides truncated.

To the original post point, it would be helpful to know what's going on here and what the recommendations are. At first glance it would seem the default 'phmm' model is ideal since the UMI is removed (though this does seem to refute #49 indicating the UMI is not trimmed...), but it's also removing extra sequence and unclear if that's supposed to be a part of the read or not.

Any feedback would be appreciated as we need this info for our pipeline development!

nrhorner commented 3 weeks ago

@CHENAO-QIAN Could you give an example of the command you used with the edlib option and what you are finding that is different between the runs please?

CHENAO-QIAN commented 2 weeks ago

Hi @nrhorner,

The cmd I used: pychopper -k PCS111 \ -m edlib \ -t 50 \ -r ${id}_pychopper_report.pdf \ -u ${id}_pychopper.unclassified.fq \ -w ${id}_pychopper.rescued.fq \ -S ${id}_pychopper.stats \ -A ${id}_pychopper.scores \ $line ${id}_pychopped.fastq

I ran the above code twice and the outputs I am getting are different. I did a quick check of the number of lines of the two outputs:

wc -l sample_pychopped3.fastq
352090580 sample_pychopped3.fastq

wc -l sample_pychopper.rescued3.fq
15962440 sample_pychopper.rescued3.fq

wc -l sample_pychopper.unclassified3.fq
34285372 sample_pychopper.unclassified3.fq
wc -l sample_pychopped6.fastq
350023924 sample_pychopped6.fastq

wc -l sample_pychopper.rescued6.fq
15446508 sample_pychopper.rescued6.fq

wc -l sample_pychopper.unclassified6.fq
37104128 sample_pychopper.unclassified6.fq