While working on the new qiime rescript trim-alignment tutorial, I started to realize that the output of trim-alignment does not seem to return expected results when using primer sequences for extracting the amplicon region, vs providing the amplicon positions directly.
When I run qiime alignment mafft-add --p-addfragments --i-sequences my_primers.qza ... and look at the exported output, I do observe what I expect. That is, I see the primer aligned where it should, and it maps to the explicit positions. In fact, I used this approach to help me map to these positions. However, when I run qiime rescript trim-alignment... the resulting extracted data appears spurious. It appears, in some cases that the sequences change / are different in the output, or the wrong region is extracted. Also, if I use 10 sequences vs 5k sequences, very different regions are extracted when using the primer search option. I am still not sure if the issue is myself, or not, so...
Anyway, I am still investigating this, and will try and come up with a better set of test cases, but perhaps, others can help me look into this. I suggest pulling the SILVA alignment, and subsample it for testing. Then compare the qiime alignment mafft-add --p-addfragments output with the qiime rescript trim-alignment outputs. I've been using the Unipro UGENE software to quickly visualize the SILVA alignments with the added primer sequences, etc.
Expected alignment positions including the primer region from 50k alignment:
11895 - 28464
Expected alignment positions excluding the primer region region from 50k alignment:
13862 - 27655
Running the following will result in a combined alignment with several thousand positions less than there should be. That is: 43,515 alignment positions instead of 50,000. The problem is that the calculated position range, for the above "excluding the primer region" changes from 13862 - 27655 (50k alignment) to 13862 - 21844 (43k alignment).
Can confirm this by exporting to FASTA and view in your favorite alignment editor.
The following command , i.e. maintains the 50k alignment positions, and returns the proper alignment ranges. If you re-run the command without --keeplength you should obtain the same result as the qiime alignment mafft-add command.
EDIT:
It appears that the issue can be partially resolved by adding mafft's --keeplength parameter to qiime alignment mafft-add for our specific purpose here. Otherwise the final combined alignment length can change. Which is what is happening in my case. That is the alignment is shrinking throwing off alignment positions when using mafft-add. I've opened an issue here.
While working on the new
qiime rescript trim-alignment
tutorial, I started to realize that the output oftrim-alignment
does not seem to return expected results when using primer sequences for extracting the amplicon region, vs providing the amplicon positions directly.When I run
qiime alignment mafft-add --p-addfragments --i-sequences my_primers.qza ...
and look at the exported output, I do observe what I expect. That is, I see the primer aligned where it should, and it maps to the explicit positions. In fact, I used this approach to help me map to these positions. However, when I runqiime rescript trim-alignment...
the resulting extracted data appears spurious. It appears, in some cases that the sequences change / are different in the output, or the wrong region is extracted. Also, if I use 10 sequences vs 5k sequences, very different regions are extracted when using the primer search option. I am still not sure if the issue is myself, or not, so...Anyway, I am still investigating this, and will try and come up with a better set of test cases, but perhaps, others can help me look into this. I suggest pulling the SILVA alignment, and subsample it for testing. Then compare the
qiime alignment mafft-add --p-addfragments
output with theqiime rescript trim-alignment
outputs. I've been using the Unipro UGENE software to quickly visualize the SILVA alignments with the added primer sequences, etc.Extra info and example commands
V4V5 primer info: 515F (Parada) - GTGYCAGCMGCCGCGGTAA 926R (Quince) - CCGYCAATTYMTTTRAGTTT
Expected alignment positions including the primer region from 50k alignment: 11895 - 28464
Expected alignment positions excluding the primer region region from 50k alignment: 13862 - 27655
Running the following will result in a combined alignment with several thousand positions less than there should be. That is: 43,515 alignment positions instead of 50,000. The problem is that the calculated position range, for the above "excluding the primer region" changes from 13862 - 27655 (50k alignment) to 13862 - 21844 (43k alignment).
Can confirm this by exporting to FASTA and view in your favorite alignment editor.
The following command , i.e. maintains the 50k alignment positions, and returns the proper alignment ranges. If you re-run the command without
--keeplength
you should obtain the same result as theqiime alignment mafft-add
command.Compare to the outputs of:
and
EDIT: It appears that the issue can be partially resolved by adding mafft's --keeplength parameter to
qiime alignment mafft-add
for our specific purpose here. Otherwise the final combined alignment length can change. Which is what is happening in my case. That is the alignment is shrinking throwing off alignment positions when usingmafft-add
. I've opened an issue here.