DMU-lilab / pTrimmer

Used to trim off the primer sequence from mutiplex amplicon sequencing
GNU General Public License v3.0
21 stars 5 forks source link

How the amplicon is counted and How insertion length is used in this program? #22

Closed yangjw1996 closed 10 months ago

yangjw1996 commented 1 year ago

Hi,xiaolong, thank you for developing ptrimmer, it's very fast and handy!

Could you please explain how the amplicon is counted in paired-end reads? (I'm not a C language user). Is it the sum of forward primer in read1 and reverse primer in read2? Besides, if the data is single-end , how the amplicon is counted?

Another question is about what role dose insertion length play in this program? Insertion length is required in '--ampfile'. But I suppose insertion may not be used in paired-end files, or it is acutually used ?

What's more, I also did some test in which I changed the insertion length to see if primers can be located right. The results show if the insertion length differ too much from the actucal insertion length in reads, ptrimmer can not locate the primers and will report the reads in 'NNNNNNNN' or report the reads as it is (if set '--keep'). In detail, the tested data is single-end, the acutual insertion length is 8 bp, I add the insertion length up to about 20 bp, then the ptrimmer can not locate the primers anymore.

This may cause trouble if I want to cut primers of amplicons targeting CNVs (copy number variantions). The insetion length of CNV is very likely to exceed the 'tolerated range' that the given insertion length differs from the actucal insertion length. I do not understand why insertion length is required, I thought the k-mer sequences will be generated from the length of reads. If insertion length is set in consideration of detecting the target amplicon since the primers may also amplify un-target sequence, I don't think it is necessary becauce subsequent alignment will adress un-target amplicons.

Could you please consider the above situation or even adjust the ptrimmer to adapt to amplicons targeting CNVs?

Thank you a lot ! Best wishes!

jiawen

XLZH commented 1 year ago

Hello, thanks for using pTrimmer!

  1. There is no significant difference between single-end and paired-end READs, because each read is processed independently when removing primers. Therefore, the amplicon count actually represents how many READs belong to the amplicon.
  2. The role of insert length is to help pTrimmer roughly determine whether the READs (belonging to the target amplicon) are in read-through condition or normal condition, because in the read-through condition (refer to: README->Input->Note->(1) read-through condition), pTrimmer needs to remove the forward primer and the reverse complement of the reverse primer at the same time. In the normal condition (refer to: README->Input->Note->(2) normal condition), pTrimmer only needs to remove the forward primer.
  3. The insert length is no necessary to be very accurate, it is only used to determine the two conditions of primer trimming. However, if the set value is too different from the real value, the matching of the reverse complementary of the reverse primer will fail. Please refer to the 'Note' section to determine a better solution for primer trimming of the target amplicon.

best wishes, xiaolong zhang

yangjw1996 commented 1 year ago

Xiaolong,thank you for your reply and advice. It helps me a lot!

As to point2 and point3, I'm not sure if I understand this correctly. Sometimes the amplicons are partly read-through and partly not read-through, so insertion length can help to determine the situations, then choose the appropriate trimming strategy.

What I mentioned that amplicons targeting CNVs belong to not read-through data in most cases, although problems can arise when the true insertion length differ too much from the specified insertion length. I now understand the length ploymorphism is the main reason for the 'blame' .

Thank you again!

jiawen

XLZH commented 1 year ago

Yes, correct!For one target region, we usually design many pairs of amplicons to cover it, some of the amplicons can read-through, others can not. pTrimmer use the insert length (usually the tools used to design the amplicons can output the information) to find a better way to trim the primer. If most of your reads not belong to read-through condition, for the rest reads (read-through), pTrimmer will only take the forward primer into consideration and ignore the reverse complement of reverse primer. By the way, if you manually set the insert length to a very small value, pTrimmer will try to find the reverse complement of reverse primer, and result in failure of primmer trimming for most of your reads (belong to not read-through).best wishes,xiaolong zhang在 2023年1月11日,15:53,yangjw1996 @.***> 写道: Xiaolong,thank you for your reply and advice. It helps me a lot! As to point2 and point3, I'm not sure if I understand this correctly. Sometimes the amplicons are partly read-through and partly not read-through, so insertion length can to determine the situations, then choose the appropriate trimming strategy. What I mentioned that amplicons targeting CNVs belong to not read-through data in most cases, although problems can arise when the true insertion length differ too much from the specified insertion length. I now understand the length ploymorphism is the main reason for the 'blame' . Thank you again! jiawen

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>