RIVM-bioinformatics / AmpliGone

A tool in order to accurately remove primer sequences from NGS reads in an amplicon experiment
https://rivm-bioinformatics.github.io/AmpliGone/
GNU Affero General Public License v3.0
9 stars 0 forks source link

Some questions regarding 16s rRNA primers #79

Open Thomieh73 opened 1 month ago

Thomieh73 commented 1 month ago

Hi, I am trying out Ampligone. It works fast but I got a few questions. I am working with Nanopore reads of 16s rRNA amplicons.

1) If you provide the tool with a single reference, than it determines that the region to keep is based on that reference, right? Does it take into account that a sequence to analyze could be the reverse complement sequence of the gene? In my nanopore data, the sequences can have both orientations.

2) Are the reads that are put in the output file, all the reads where both primers have been detected? or is it all fragments that match the reference. I am just wondering how to look at this data. What happens with reads where only one or none of the primers were detected?

3) How does Ampligone deal with high diversity of the sequence between the primers? The 16S rRNA amplicons can be quite diverse when an environmental samples such as fecal material is taken. Would a reference dataset with >20.000 full length references help with that?

florianzwagemaker commented 3 weeks ago

Hi @Thomieh73!

Thank you for your questions, i hope i can answer them sufficiently for you.

  1. The region to keep is indeed based on the reference (or references), but orientation of reads does not matter. During the process of aligning the reads to the reference, both the original as well as the reverse-complement orientation are attempted. The best fitting option is then used for further processing.

  1. The reads in the output file are all the reads that match the reference(s) given. I'm writing out a quick example that hopefully explains better what happens with the output when only one or neither of the read-sides start or end within a primer region.

    Let's say that we have have a reference of 3000 nucleotides long and we have two primers that were used to amplify the fragment that corresponds to this; primer_forward and primer_reverse. primer_forward corresponds to positions 200 to 235 when matched to the reference. primer_right corresponds to positions 2500 to 2530 when matched to the reference. When AmpliGone aligns a read to the reference it sees that the alignment of this read starts at position 200 of the reference, it also sees that the alignment of this read ends at position 2100. The start of this read corresponds perfectly with the primer region of primer_forward, therefore AmpliGone will cut the read so that the new starting position of the read will become position 236. Thus having completely removed the primer as this is no longer in the primer region. On the other side, the read ends at position 2100, this is before the right-side primer region, AmpliGone will therefore determine that there is no need to cut this side of the read as there should be no primer sequence on this side of the sequenced read.

    The next read that is being aligned by AmpliGone to the reference starts at position 400 of the reference and ends at position 1900. Neither of these start- or stop-positions of the aligned read correspond to the primer regions established earlier. Therefore AmpliGone will not perform any cutting on this read, if will however include it in the output data because it could align to the reference without any issues. And just because there are no primers to be removed doesn't mean it's incorrect or invalid data that could be useful in another process or analysis step.

    I hope this somewhat clears up the steps and or logic that happens behind the scenes when dealing with "non-perfect" sequenced amplicons 😄


  1. The heavy lifting of read-alignment within AmpliGone is done by the wonderful minimap2 read aligner (or the python bindings to be exact). So sequence diversity is handled by AmpliGone just as minimap2 would handle this, which can be summarized as "it's pretty lenient most of the times". I however have not tested the limits of sequence-diversity within AmpliGone (or minimap2 for that matter) so i cannot help you with any supporting numbers when it comes to this. (p.s. we have a version 2.0.0 upcoming that will allow you to change the alignment parameters in order to finetune this to your usecase) Adding many references certainly helps when it comes to this as there's simply more options to choose from during the alignment steps.

    That being said, I don't think providing AmpliGone with 20.000 reference sequences (or a thousand even) is a good plan. Mainly because searching for primer-sequences in the references is by far the slowest part of AmpliGone and going through such a large amount of references will probably take several hours, if not multiple days (depending on the complexity of the primers). Additionally, you probably have to do this a few times to finetune the primer search error-rate to make sure the primers get found in every reference. I'm working on making this way more efficient in our 2.0.0 version that is on the horizon but this will take a while as I don't have that much time right now that I can dedicate to fixing this issue.

    However, I think using 20.000 references (or more) would be doable if you switch some steps up in the analysis. Because our read-alignment works through minimap2, it's possible to use minimap2 as a preprocessor for AmpliGone. What i mean with this is that you could use minimap2 as a tool to select the appropriate references from this pool of +20.000 references which actually match the reads that you're trying to analyze. This way you can bring the number of references back from +20.000 to something like 5 or maybe 15 references. And that number would make it a lot more feasible to let AmpliGone search through these reference for possible primer-binding sites.

    In practice this would come down to using minimap2 like follows:

    minimap2 -ax map-ont --secondary=no -t 8 references.fasta reads.fastq | samtools view -@ 4 -F 256 -F 512 -F 4 -F 2048 -uS | samtools sort -o aligned_reads.bam
    samtools index aligned_reads.bam

    Then you can use a script like this: https://gist.github.com/florianzwagemaker/6459c204eb2ff1166bcae6ffe2ca27ec Which will go over the bamfile and references fasta and will write a filtered references fasta file with only the references that actually had more than 20 mapped reads. It's then possible to give this fasta with 'filtered' reference to AmpliGone together with the original reads in order to speed the process up significantly. It may seem like a slightly more convoluted way to go about this but it might be worth a shot if you're willing.

My apologies for the delay in this answer (a lot of work came up past week all of a sudden). I hope however that the answers here are able to bring you a step closer to your goals, and I'd love to hear how AmpliGone works for you. If you have any more questions about AmpliGone then please let me know.