DMU-lilab / pTrimmer

Used to trim off the primer sequence from mutiplex amplicon sequencing
GNU General Public License v3.0
21 stars 5 forks source link

Empty sequence caused by too short opposite primer at 3'-end #4

Closed XLZH closed 4 years ago

XLZH commented 4 years ago

Dear developers, Is it possible that the software discards paired reads where both primers are not found? If the amplicon is large, then some of the reads will not have the opposite primer at the 3'-end or if they have it, they might have just a small part. One of my amplicos is being completely missed and NNNNN are reported in the final files. Here are 3 reads for that amplicon (paired-end).

r1.fastq

@M03970:332:000000000-J2CK5:1:1101:16151:2887 1:N:0:15 GTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACA +

AABFCFDDFFGFGGGGGAGFHFEGG2A2AEEG3FFG2GFHFHGE1AEEHHHHHGGGHH3FDFHHHHHHHEGHGGFGGGH3FFFBFFC?BGFHHHHEGHHFFDH?FHGHH2B?G/?GGFFG//FC@2<FFGGHHBDGHHGD//--< @M03970:332:000000000-J2CK5:1:1101:21721:3033 1:N:0:15 GTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACA + BBCCCFFFFFFFGGGGGGGGGGHHGGCFGEEEEGHHHHHHHHHHHGGGGGHHHHHGGGHHHHHHHHHHHHHHHHHGGGGGFHHHHHHHGGHHHHHHHHHFHHHHHHHHHHHHHGHGGGHHGGGGHGHFGHHHHHHHCEHHHGGGGGGGG @M03970:332:000000000-J2CK5:1:1101:14369:3115 1:N:0:15 GTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGCCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACA + CCCCCFFFFFFFGGGGGGGGGGHHGGGGGGGGGGHHHHHHHHHHHGGGGGHHHHHGGGHHHHHHHHHHHHHHHHHGGGGGHHHHHHHHGGHHHHHHHHHHHHHHHHHHHHHHHGHGGGHHGGGGHHH@FGHHHHHHHGHHHGGGGGGGG

r2.fastq

@M03970:332:000000000-J2CK5:1:1101:16151:2887 2:N:0:15 AGCCCGAACGCAAAGTGTCCCCGGAGCCCAGCAGCTACCTGCTCCCTGGACGGTGGCTCTAGCCTTTTGAGAAGCTCAAAACTTTTAGCGCCAGTCTTGAGCACATGGGAGGGGAACACCCCAATCCCATCAACCCCTGCGAGGCTACTGG + ?AA@ADDDDDADG?BGG3FGGGGE0/AFE/EFB0GF1FFDDHBGHGFCG0BE/>EEHCGF1B1B@11B1//>B0F1G1@11BFFGD2>GEGGCEBFHEFFBFHFH1F0/0?CCC///<@EGG?/FFHHFHHF1FGGGE..<<<-CC0<:CG @M03970:332:000000000-J2CK5:1:1101:21721:3033 2:N:0:15 AGCCCGAACGCAAAGTGTCCCCGGAGCCCAGCAGCTACCTGCTCCCTGGACGGTGGCTCTAGACTTTTGAGAAGCTCAAAACTTTTAGCGCCAGTCTTGAGCACATGGGAGGGGAAAACCCCAATCCCATCAACCCCTGCGAGGCTCCTGG + ABBBBB?ADDBB?CGGGGGGGGGGGGGGGFFHHHHHGHHHGHHHHGHHHBHGGEHGHHHHBG4FFHHHHAF?4BFHHB@@3FGHHH4EGGGGGGGHHHHDHHHHFGFHGGFGGGGG/FHGGGA/GHHHFHH11GGGGGGHGCDCG.CGHGH @M03970:332:000000000-J2CK5:1:1101:14369:3115 2:N:0:15 AGCCCGAACGCAAAGTGTCCCCGGAGCCCAGCAGCTACCTGCTCCCTGGACGGTGGCTCTAGGCTTTTGAGAAGCTCAAAACTTTTAGCGCCAGTCTTGAGCACATGGGAGGGGAAAACCCCAATCCCATCAACCCCTGCGAGGCTCCTGG + AB@AABBBBBBBGGFGFGGGGGGGCGGGGHHHHGHHHHHHHHHHHGHHHHHGGEGDHHHHFHHHHHHFHFGHHHHHHHFFFHHHHHHHHGGGGGHHHHHHHHHHHHHHGHGGGGGGHHHGGGGGHHHHHHHHHHHGGGGHGGGGGFHGHHH

Primers used to amplify the region:

TP53_Ex1_a_5 GTCCAGCTTTGTGCCAGGAG TP53_Ex1_a_3 AGCCCGAACGCAAAGTGT

amplicon_primers.txt

AGCCCGAACGCAAAGTGT GTCCAGCTTTGTGCCAGGAG 125

Thanks very much in advance.

Regards,

Sheila

Originally posted by @smzt in https://github.com/DMU-lilab/pTrimmer/issues/3#issuecomment-637606797

XLZH commented 4 years ago

Hello @smzt,

pTrimmer supports both "long amplicon size (normal condition)" and "short amplicon size (read-through condition)". The amplicon you provided, in fact, is the "read-through condition". But the opposite primer at the 3'-end is too short (even shorter than one kmer length) to be located! To obtain accurate results, we prefer to discard such reads.

The match condition of your read is as follows: fastq1: opposite primer at 3'-end only has 3 bases (ACA) fastq2: opposite primer at 3'-end only has 7 bases (CTCCTGG)

----- fastq 1 -----
@M03970:332:000000000-J2CK5:1:1101:16151:2887 1:N:0:15
GTCCAGCTTTGTGCCAGGAG CCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGG ACA
GTCCAGCTTTGTGCCAGGAG                                                                                                                                ACACTTTGCGTTCGGGCT

----- fastq 2 -----
@M03970:332:000000000-J2CK5:1:1101:21721:3033 2:N:0:15
AGCCCGAACGCAAAGTGT CCCCGGAGCCCAGCAGCTACCTGCTCCCTGGACGGTGGCTCTAGACTTTTGAGAAGCTCAAAACTTTTAGCGCCAGTCTTGAGCACATGGGAGGGGAAAACCCCAATCCCATCAACCCCTGCGAGG CTCCTGG
AGCCCGAACGCAAAGTGT                                                                                                                                CTCCTGGCACAAAGCTGGAC
smzt commented 4 years ago

Hi Xiaolong, That's the reason why I contacted you. This issue is very common when analyzing amplicons with NGS, in some of them you do not have the full sequence of the primer at the 3'-end. It a pity your tool does not include an option to either remove this small parts of the primer at the 3'-end or at least give the option to retain these reads.

Thank you very much for your quick reply.

Regards,

Sheila

On Wed, Jun 3, 2020 at 3:26 AM Xiaolong Zhang notifications@github.com wrote:

Hello @smzt https://github.com/smzt,

pTrimmer supports both "long amplicon size (normal condition)" and "short amplicon size (read-through condition)". The amplicon you provided, in fact, is the "read-through condition". But the opposite primer at the 3'-end is too short (even shorter than one kmer length) to be located! To obtain accurate results, we prefer to discard such reads.

The match condition of your read is as follows: fastq1: opposite primer at 3'-end only has 3 bases (ACA) fastq2: opposite primer at 3'-end only has 7 bases (CTCCTGG)

----- fastq 1 ----- @M03970:332:000000000-J2CK5:1:1101:16151:2887 1:N:0:15 GTCCAGCTTTGTGCCAGGAG CCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGG ACA GTCCAGCTTTGTGCCAGGAG ACACTTTGCGTTCGGGCT

----- fastq 2 ----- @M03970:332:000000000-J2CK5:1:1101:21721:3033 2:N:0:15 AGCCCGAACGCAAAGTGT CCCCGGAGCCCAGCAGCTACCTGCTCCCTGGACGGTGGCTCTAGACTTTTGAGAAGCTCAAAACTTTTAGCGCCAGTCTTGAGCACATGGGAGGGGAAAACCCCAATCCCATCAACCCCTGCGAGG CTCCTGG AGCCCGAACGCAAAGTGT CTCCTGGCACAAAGCTGGAC

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DMU-lilab/pTrimmer/issues/4#issuecomment-637899313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHXNCKN3FPO2AK3H4UFSONTRUWRCXANCNFSM4NRHCACA .

XLZH commented 4 years ago

Hi Sheila,

pTrimmer is able to process most of the 'part read-through' primer sequence at the 3'-end. But the part of 3'-end primer sequence must be longer than one k-mer (default: 8), which is a relative insurance strategy to prevent wrong match.

For the amplicon you provided: (1) You can set the parameter '-k|--kmer' to 7 to process most of the reads shown bellow (7 bases read-through).

----- fastq 2 -----
@M03970:332:000000000-J2CK5:1:1101:21721:3033 2:N:0:15
AGCCCGAACGCAAAGTGT CCCCGGAGCCCAGCAGCTACCTGCTCCCTGGACGGTGGCTCTAGACTTTTGAGAAGCTCAAAACTTTTAGCGCCAGTCTTGAGCACATGGGAGGGGAAAACCCCAATCCCATCAACCCCTGCGAGG CTCCTGG
AGCCCGAACGCAAAGTGT                                                                                                                                CTCCTGGCACAAAGCTGGAC

(2) There also has a parameter '-l|--keep' to retain those reads that failed to locate primer sequence

Best, Xiaolong Zhang