bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
84 stars 26 forks source link

providing primers for `trim-alignment` provides different output vs explicit column positions for the same primers. #175

Open mikerobeson opened 4 months ago

mikerobeson commented 4 months ago

While working on the new qiime rescript trim-alignment tutorial, I started to realize that the output of trim-alignment does not seem to return expected results when using primer sequences for extracting the amplicon region, vs providing the amplicon positions directly.

When I run qiime alignment mafft-add --p-addfragments --i-sequences my_primers.qza ... and look at the exported output, I do observe what I expect. That is, I see the primer aligned where it should, and it maps to the explicit positions. In fact, I used this approach to help me map to these positions. However, when I run qiime rescript trim-alignment... the resulting extracted data appears spurious. It appears, in some cases that the sequences change / are different in the output, or the wrong region is extracted. Also, if I use 10 sequences vs 5k sequences, very different regions are extracted when using the primer search option. I am still not sure if the issue is myself, or not, so...

Anyway, I am still investigating this, and will try and come up with a better set of test cases, but perhaps, others can help me look into this. I suggest pulling the SILVA alignment, and subsample it for testing. Then compare the qiime alignment mafft-add --p-addfragments output with the qiime rescript trim-alignment outputs. I've been using the Unipro UGENE software to quickly visualize the SILVA alignments with the added primer sequences, etc.

Extra info and example commands

V4V5 primer info: 515F (Parada) - GTGYCAGCMGCCGCGGTAA 926R (Quince) - CCGYCAATTYMTTTRAGTTT

Expected alignment positions including the primer region from 50k alignment: 11895 - 28464

Expected alignment positions excluding the primer region region from 50k alignment: 13862 - 27655

Running the following will result in a combined alignment with several thousand positions less than there should be. That is: 43,515 alignment positions instead of 50,000. The problem is that the calculated position range, for the above "excluding the primer region" changes from 13862 - 27655 (50k alignment) to 13862 - 21844 (43k alignment).

qiime alignment mafft-add \
    --i-alignment silva-138-1-nr99-aln-dna-ss01.qza \
    --i-sequences V4V5_515F_926R_primers.qza \
    --p-addfragments \
    --o-expanded-alignment silva-138-1-nr99-aln-dna-ss01-v4v5.qza

Can confirm this by exporting to FASTA and view in your favorite alignment editor.

The following command , i.e. maintains the 50k alignment positions, and returns the proper alignment ranges. If you re-run the command without --keeplength you should obtain the same result as the qiime alignment mafft-add command.

mafft --preservecase --inputorder --thread 8 --keeplength   \
    --addfragments V4V5_515F_926R_primers.fasta  \
    silva-138-1-nr99-aln-dna-ss01-export/aligned-dna-sequences.fasta   \
    > aligned-dna-sequences-v4v5-pq.fasta

Compare to the outputs of:

qiime rescript trim-alignment \
    --i-aligned-sequences silva-138-1-nr99-aln-dna-ss01.qza \
    --p-primer-fwd GTGYCAGCMGCCGCGGTAA \
    --p-primer-rev CCGYCAATTYMTTTRAGTTT \
    --o-trimmed-sequences silva-138-1-nr99-aln-dna-ss01-v4v5-primer-trimmed.qza

and

qiime rescript trim-alignment \
    --i-aligned-sequences silva-138-1-nr99-aln-dna-ss01.qza \
    --p-position-start 11895 \
    --p-position-end 28464 \
    --o-trimmed-sequences silva-138-1-nr99-aln-dna-ss01-v4v5-pos-trimmed.qza

EDIT: It appears that the issue can be partially resolved by adding mafft's --keeplength parameter to qiime alignment mafft-add for our specific purpose here. Otherwise the final combined alignment length can change. Which is what is happening in my case. That is the alignment is shrinking throwing off alignment positions when using mafft-add. I've opened an issue here.