Closed gregcaporaso closed 9 years ago
I'm currently working on a fix for this. With PyNAST-aligned versions of the 13_8 85% OTUs, I'm still seeing slightly lower percent id matches for the V2 sequences (data here, though the results are much better now than when comparing against qiime-default-reference 0.1.1
. However... not all of the 85% OTUs were align-able with PyNAST (I think) as I'm finding that filtering as follows results in fewer aligned than unaligned sequences:
filter_fasta.py -f $HOME/data/gg_13_8_otus/rep_set_aligned/gg_13_5_pynast.fasta -o $HOME/data/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta -a $HOME/data/gg_13_8_otus/rep_set/85_otus.fasta
and that there are still sometimes very minor differences in the sequences if I unalign the aligned sequence and compare them against the corresponding sequences in gg_13_8/rep_set/85_otus.fasta
.
I'm continuing to look into this, and am considering reverting to the core set for the time-being in qiime-default-reference 0.1.2
(if the licensing allows that, which I'll be looking into).
In the meantime, the work-around for QIIME users working with non-515F/806R 16S primers will be to align against the core set, which is available here. This file path can be added as the value for See updated work-around below.pynast_template_alignment_fp
in qiime_config
, as described here.
@gregcaporaso, the following should do it (pynast aligned here)
gunzip gg_13_5_pynast.fasta.gz
filter_fasta.py -f gg_13_5_pynast.fasta -o $HOME/data/gg_13_8_otus/rep_set_aligned/85_otus.pynast.fasta -a $HOME/data/gg_13_8_otus/rep_set/85_otus.fasta
Is that different than the command I posted above?
Not sure, is that input file the same as the file I linked? It isn't part of the otus package hence the confusion
A length check may not be sufficient if pynast can trim 5' or 3' On Apr 14, 2015 17:53, "Greg Caporaso" notifications@github.com wrote:
Is that different than the command I posted above?
— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime-default-reference/issues/14#issuecomment-93126479 .
Also, there were about 1500 seqs that didn't align with pynast as youve noted On Apr 14, 2015 17:57, "Daniel T. McDonald" Daniel.Mcdonald@colorado.edu wrote:
Not sure, is that input file the same as the file I linked? It isn't part of the otus package hence the confusion
A length check may not be sufficient if pynast can trim 5' or 3' On Apr 14, 2015 17:53, "Greg Caporaso" notifications@github.com wrote:
Is that different than the command I posted above?
— Reply to this email directly or view it on GitHub https://github.com/biocore/qiime-default-reference/issues/14#issuecomment-93126479 .
I think you forgot the link, but I pulled my file from secondgenome's download page, and the md5 is correct:
MD5 (gg_13_5_pynast.fasta.gz) = 24720085027c35132de0588fee518065
Ok, given that it's known that some of the sequences in 13_5 don't align with PyNAST (which would explain why I have fewer), and that the differences that I notice in the sequences are small and (based on spot-checks) in the ends of sequences, I think we're good with using this PyNAST 85% OTU alignment as the default template alignment. I'll work on building and testing this release, then testing with QIIME.
Updated work-around: In the meantime, the work-around for QIIME users working with non-515F/806R 16S primers will be to align against the Greengenes 13_5 PyNAST 85% OTU alignment, which is available here. After downloading and unzipping that file, it's absolute file path can be added as the value for pynast_template_alignment_fp
in qiime_config
, as described here. This will achieve the same result as when we release qiime-default-reference 0.1.2
.
Apparently github silently doesn't render FTP links, sorry about that. Thanks for getting this together, and yet more motivation for another GG release
In an additional test of PyNAST alignment of V2 sequences with the core set versus the PyNAST 85% OTU alignment added in #15, I find that only slightly more sequences fail to align with the PyNAST 85% OTU alignment than with the old core set. Details of my test follow. Since we don't have an inherent way to say which is better (i.e., it's possible that we want those additional sequences to fail), and we have complete understanding of how the PyNAST 85% OTU alignment was generated (which is not true for the core set), we'll move forward with the plan to use the PyNAST 85% OTU alignment as the default reference template alignment.
My first test was performed with the V2 sequences deposited under study id 449
in the QIIME database (the Costello Whole Body) dataset. I ran the following command to pick OTUs and perform PyNAST alignment against the PyNAST 85% OTU alignment:
$ pick_open_reference_otus.py -i /home/caporaso/analysis/whole-body/study_449_split_library_seqs.fna --suppress_taxonomy_assignment -aO 29 -o /home/caporaso/analysis/whole-body/or-otus
$ count_seqs.py -i "pynast_aligned_seqs/rep_set_*fasta"
204 : pynast_aligned_seqs/rep_set_failures.fasta (Sequence lengths (mean +/- std): 181.9706 +/- 56.2609)
10200 : pynast_aligned_seqs/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
10404 : Total
This tells us that 204 sequences failed to align with PyNAST out of 10664 (there is one header line).
Then, I re-aligned the representative sequences against the core set
$ parallel_align_seqs_pynast.py -i /home/caporaso/analysis/whole-body/or-otus/rep_set.fna -o /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs_core_set --jobs_to_start 30 -t /data/qiime_software/core_set_aligned.fasta.imputed
$ count_seqs.py -i "pynast_aligned_seqs_core_set/rep_set_*fasta"
171 : pynast_aligned_seqs_core_set/rep_set_failures.fasta (Sequence lengths (mean +/- std): 166.0292 +/- 46.7550)
10233 : pynast_aligned_seqs_core_set/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
10404 : Total
Also, it's important to note that different sequences failed to align with the two data sets (i.e., the sequences that fail to align against the core set are not a subset of the sequences that fail to align against the 85% PyNAST-aligned OTUs).
# find the sequences that didn't align against the 85% PyNAST-aligned OTUs
# but did align against the core set
$ filter_fasta.py -f pynast_aligned_seqs/rep_set_failures.fasta -o pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta -a pynast_aligned_seqs_core_set/rep_set_aligned.fasta
caporaso@bacon 08:07:58 or-otus$ count_seqs.py -i pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta
47 : pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta (Sequence lengths (mean +/- std): 256.3404 +/- 16.3086)
47 : Total
# find the sequences that didn't align against the core set
# but did align against the 85% PyNAST-aligned OTUs
$ filter_fasta.py -f pynast_aligned_seqs_core_set/rep_set_failures.fasta -o pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus.fasta -a pynast_aligned_seqs/rep_set_aligned.fasta
$ count_seqs.py -i pynast_aligned_seqs/failures-that-aligned-v-pyn-85-otus.fasta
14 : pynast_aligned_seqs/failures-that-aligned-v-pyn-85-otus.fasta (Sequence lengths (mean +/- std): 236.9286 +/- 11.0677)
14 : Total
I'm now investigating what those hit, and will follow up on this issue when I have time to analyze.
$ parallel_blast.py -i /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus.fasta -o /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus-blast-results/ -b /data/blast/nt -O 10
$ grep -v '^#' pynast_aligned_seqs/failures-that-aligned-v-core-set-blast-results/failures-that-aligned-v-core-set_blast_out.txt
New.CleanUp.ReferenceOTU10449 gi|238404362|gb|GQ002078.1| 100.00 150 0 0 1 150 317 168 3e-77 297
New.CleanUp.ReferenceOTU10796 gi|296967670|gb|HM274075.1| 100.00 132 0 0 1 132 317 186 1e-66 262
New.CleanUp.ReferenceOTU11004 gi|322225548|gb|JF240143.1| 100.00 148 0 0 1 148 317 170 4e-76 293
New.CleanUp.ReferenceOTU11548 gi|344263683|gb|JN382529.1| 96.95 164 1 3 1 164 304 145 3e-67 264
New.CleanUp.ReferenceOTU12392 gi|322226988|gb|JF241583.1| 100.00 158 0 0 1 158 317 160 4e-82 313
New.CleanUp.ReferenceOTU12511 gi|322128498|gb|JF143092.1| 99.36 156 0 1 1 155 317 162 4e-76 293
New.CleanUp.ReferenceOTU12532 gi|297354760|gb|HM084636.1| 98.94 188 2 0 4 191 202 15 4e-95 357
New.CleanUp.ReferenceOTU13882 gi|326833420|gb|JF487288.1| 96.91 194 1 4 1 194 201 13 3e-80 307
New.CleanUp.ReferenceOTU14202 gi|460693648|gb|JX812596.1| 98.79 165 0 1 1 165 317 155 1e-79 305
New.CleanUp.ReferenceOTU15614 gi|399187736|gb|JQ461075.1| 100.00 141 0 0 1 141 330 190 6e-72 280
New.CleanUp.ReferenceOTU16729 gi|322187448|gb|JF202043.1| 100.00 182 0 0 1 182 305 124 2e-96 361
New.CleanUp.ReferenceOTU17453 gi|297367873|gb|HM097749.1| 99.31 145 1 0 1 145 245 101 6e-72 280
New.CleanUp.ReferenceOTU18041 gi|545339281|gb|KF502513.1| 99.04 209 0 2 1 209 174 380 6e-103 383
New.CleanUp.ReferenceOTU18339 gi|297368394|gb|HM098270.1| 100.00 184 0 0 1 184 237 54 2e-97 365
New.CleanUp.ReferenceOTU18724 gi|385889132|gb|JQ726783.1| 99.44 179 0 1 1 178 312 134 8e-90 339
New.CleanUp.ReferenceOTU1891 gi|399204424|gb|JQ477763.1| 96.24 186 2 3 1 186 348 168 4e-76 293
New.CleanUp.ReferenceOTU2098 gi|290733018|gb|GU536833.1| 100.00 162 0 0 1 162 326 165 2e-84 321
New.CleanUp.ReferenceOTU21175 gi|545342750|gb|KF505982.1| 100.00 141 0 0 1 141 182 322 6e-72 280
New.CleanUp.ReferenceOTU21766 gi|322194586|gb|JF209181.1| 100.00 180 0 0 1 180 308 129 4e-95 357
New.CleanUp.ReferenceOTU21849 gi|238268948|gb|GQ037048.1| 100.00 155 0 0 1 155 316 162 3e-80 307
New.CleanUp.ReferenceOTU22455 gi|270087661|gb|GQ949123.1| 98.45 193 0 2 1 193 229 40 1e-91 345
New.CleanUp.ReferenceOTU22842 gi|238283618|gb|GQ063996.1| 98.63 146 0 2 1 146 298 155 2e-65 258
New.CleanUp.ReferenceOTU22953 gi|214017914|gb|FJ363527.1| 100.00 181 0 0 1 181 340 160 9e-96 359
New.CleanUp.ReferenceOTU2306 gi|297036657|gb|HM343062.1| 98.80 166 0 2 1 166 326 163 3e-77 297
New.CleanUp.ReferenceOTU23446 gi|326814557|gb|JF468425.1| 99.33 150 1 0 1 150 212 63 6e-75 289
New.CleanUp.ReferenceOTU24551 gi|402535884|gb|JX515809.1| 97.95 195 1 3 1 195 192 1 2e-87 331
New.CleanUp.ReferenceOTU24622 gi|238305188|gb|GQ069011.1| 100.00 150 0 0 1 150 317 168 3e-77 297
New.CleanUp.ReferenceOTU24747 gi|322226120|gb|JF240715.1| 100.00 148 0 0 1 148 317 170 4e-76 293
New.CleanUp.ReferenceOTU24753 gi|326824061|gb|JF477929.1| 94.25 226 1 8 1 225 221 7 6e-75 289
New.CleanUp.ReferenceOTU303 gi|295983658|gb|HM021376.1| 100.00 185 0 0 1 185 300 116 3e-98 367
New.CleanUp.ReferenceOTU3067 gi|511487215|gb|KC999379.1| 98.47 196 0 3 1 196 265 73 2e-90 341
New.CleanUp.ReferenceOTU3375 gi|478686098|gb|JX520102.1| 100.00 150 0 0 1 150 278 129 3e-77 297
New.CleanUp.ReferenceOTU3384 gi|297361799|gb|HM091675.1| 96.40 222 8 0 1 222 231 10 3e-101 377
New.CleanUp.ReferenceOTU3739 gi|558620814|gb|KF632504.1| 100.00 188 0 0 1 188 198 11 6e-100 373
New.CleanUp.ReferenceOTU4059 gi|559795125|ref|NR_104711.1| 100.00 139 0 0 1 139 298 160 1e-70 276
New.CleanUp.ReferenceOTU4528 gi|375112268|gb|JQ071822.1| 99.31 144 1 0 1 144 260 117 3e-71 278
New.CleanUp.ReferenceOTU5484 gi|545341529|gb|KF504761.1| 100.00 144 0 0 1 144 181 324 1e-73 285
New.CleanUp.ReferenceOTU6173 gi|409109600|gb|JX489915.1| 96.74 184 6 0 1 184 338 155 3e-83 317
New.CleanUp.ReferenceOTU6680 gi|335334350|emb|FN658915.1| 100.00 144 0 0 1 144 335 192 1e-73 285
New.CleanUp.ReferenceOTU720 gi|547220263|dbj|AB810555.1| 100.00 144 0 0 1 144 236 93 1e-73 285
New.CleanUp.ReferenceOTU7237 gi|326814557|gb|JF468425.1| 99.33 150 1 0 1 150 212 63 7e-75 289
New.CleanUp.ReferenceOTU7455 gi|326833420|gb|JF487288.1| 95.15 206 5 5 1 206 201 1 6e-75 289
New.CleanUp.ReferenceOTU7553 gi|290741669|gb|GU545484.1| 96.46 198 6 1 1 198 197 1 7e-87 329
New.CleanUp.ReferenceOTU7914 gi|322226988|gb|JF241583.1| 100.00 158 0 0 1 158 317 160 5e-82 313
New.CleanUp.ReferenceOTU7942 gi|297357653|gb|HM087529.1| 99.39 164 1 0 1 164 214 51 3e-83 317
New.CleanUp.ReferenceOTU8857 gi|459996585|gb|KC510222.1| 98.26 172 3 0 1 172 334 163 3e-83 317
New.CleanUp.ReferenceOTU9075 gi|326815667|gb|JF469535.1| 96.07 178 6 1 1 177 178 1 7e-75 289
$ parallel_blast.py -i /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs/failures-that-aligned-v-core-set.fasta -o /home/caporaso/analysis/whole-body/or-otus/pynast_aligned_seqs/failures-that-aligned-v-core-set-blast-results/ -b /data/blast/nt -O 10
$ grep -v '^#' pynast_aligned_seqs_core_set/failures-that-aligned-v-pyn-85-otus-blast-results/failures-that-aligned-v-pyn-85-otus_blast_out.txt
New.CleanUp.ReferenceOTU11778 gi|7670301|dbj|AB029368.1| 100.00 203 0 0 1 203 503 301 5e-109 402
New.CleanUp.ReferenceOTU14235 gi|379462481|gb|JQ164969.1| 97.87 188 2 1 1 188 317 132 1e-88 335
New.CleanUp.ReferenceOTU14439 gi|341989644|gb|JF554459.1| 99.55 223 1 0 1 223 331 109 2e-118 434
New.CleanUp.ReferenceOTU15749 gi|545342948|gb|KF506180.1| 97.24 181 4 1 1 181 155 334 2e-81 311
New.CleanUp.ReferenceOTU16523 gi|401661910|emb|FR714868.1| 96.93 228 2 4 1 228 275674 275452 3e-98 367
New.CleanUp.ReferenceOTU20112 gi|545343825|gb|KF507057.1| 99.37 159 0 1 1 159 154 311 6e-78 299
New.CleanUp.ReferenceOTU20493 gi|545343825|gb|KF507057.1| 98.78 164 2 0 1 164 154 317 6e-81 309
New.CleanUp.ReferenceOTU2210 gi|154198989|gb|EF706223.1| 98.78 164 2 0 1 164 255 92 7e-81 309
New.CleanUp.ReferenceOTU2265 gi|301286413|gb|HM785615.1| 98.91 183 2 0 1 183 238 56 3e-92 347
New.CleanUp.ReferenceOTU23577 gi|322201028|gb|JF215623.1| 100.00 179 0 0 1 179 311 133 1e-94 355
New.CleanUp.ReferenceOTU24778 gi|319496516|gb|HQ789377.1| 98.35 182 3 0 1 182 308 127 3e-89 337
New.CleanUp.ReferenceOTU2799 gi|443301425|emb|HE613602.1| 93.94 231 12 2 1 230 353 124 2e-87 331
New.CleanUp.ReferenceOTU5311 gi|322201028|gb|JF215623.1| 100.00 179 0 0 1 179 311 133 1e-94 355
New.CleanUp.ReferenceOTU8921 gi|322201028|gb|JF215623.1| 99.44 179 1 0 1 179 311 133 3e-92 347
My next test was performed with the V4 sequences deposited under study id 550
in the QIIME database (the Moving Pictures) dataset. I ran the following command to pick OTUs and perform PyNAST alignment against the PyNAST 85% OTU alignment:
$ parallel_align_seqs_pynast.py -i $PWD/rep_set.fna -o $PWD/pynast_aligned_seqs_85_pyn --jobs_to_start 30
$ count_seqs.py -i "$PWD/pynast_aligned_seqs_85_pyn/rep_set_*fasta"
17246 : /home/caporaso/analysis/moving-pictures/or_otus/pynast_aligned_seqs_85_pyn/rep_set_failures.fasta (Sequence lengths (mean +/- std): 92.2641 +/- 17.1674)
131449 : /home/caporaso/analysis/moving-pictures/or_otus/pynast_aligned_seqs_85_pyn/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
148695 : Total
This is 11.6% failing to align against the 85% PyNAST-aligned OTUs.
$ parallel_align_seqs_pynast.py -i $PWD/rep_set.fna -o $PWD/pynast_aligned_seqs_core_set --jobs_to_start 30 -t /data/qiime_software/core_set_aligned.fasta.imputed
$ count_seqs.py -i "pynast_aligned_seqs_core_set/*fasta"
18487 : pynast_aligned_seqs_core_set/rep_set_failures.fasta (Sequence lengths (mean +/- std): 94.8743 +/- 19.5905)
130208 : pynast_aligned_seqs_core_set/rep_set_aligned.fasta (Sequence lengths (mean +/- std): 7682.0000 +/- 0.0000)
148695 : Total
This is 12.4% failing to align against the core set.
A QIIME Forum user working with V2 sequences noticed that many more sequences were failing to be aligned with PyNAST in QIIME 1.9.0 than in QIIME 1.8.0. After some investigation, it turns out that the alignments packaged with Greengenes (and thus in the 0.1.0 and 0.1.1 releases of
qiime-default-reference
) are not in fact the PyNAST aligned sequences, but rather are SSU aligned, and some positions are filtered. We will be fixing this forqiime-default-reference 0.1.2
.This doesn't seem to be an issue when working with sequences generated from the 515F/806R primers (which is what I tested with when I brought the question of switching from the old core set to the 85% aligned OTUs), but it does appear to be a problem with V2 sequences (presumably from the 27F primer).
For users working with non-515F/806R sequences, this means that PyNAST alignments will need to be regenerated. If you're using OTU tables that were generated with
pick_open_reference_otus.py
, those should be regenerated, as theno_pynast_failures
table will have many OTUs filtered from it. You'll need to first upgrade toqiime-default-reference 0.1.2
. Once that goes live, you can do that as follows:We'll follow up on this issue, the QIIME blog, and the QIIME Forum when this is ready.