Greengenes ID for Picrust

benjjneb / dada2

Accurate sample inference from amplicon data with single nucleotide resolution

http://benjjneb.github.io/dada2/

GNU Lesser General Public License v3.0

469 stars 142 forks source link

Greengenes ID for Picrust #168

Closed chadsmith123 closed 7 years ago

chadsmith123 commented 7 years ago

I was wondering if it is possible to get the Greengenes reference sequence ID used to assign taxonomy to the sequence variants inferred by dada2. This information is useful for other pipelines such as PICRUSt.

Cheers, Chad

vmaffei commented 7 years ago

Do you have a the original Greengene database files already?(ftp://greengenes.microbio.me/greengenes_release/gg_13_5/gg_13_8_otus.tar.gz) Try assigning taxonomy with an original Greengenes _otus.fasta file found within the "rep_set/" folder (replace with desired cluster percentage 97, 99, etc). The _otus.fasta file contains OTUIDs instead of OTUID-free Greengenes taxonomy, which are contained in the DADA2-optimized greengenes database. If interested, do a quick less on both files to check out the difference. After assignment, you can use read.table() to read in the matching _otu_taxonomy.txt file found in "taxonomy/" if you want to add the taxonomy in as well as OTUIDs for other picrust or non-picrust pipelines. Happy to add example code for that if interested.

chadsmith123 commented 7 years ago

I ran assignTaxonomy with the Greengenes database from my Qiime install but got a memory allocaton failure. Using the DADA2 optimized Greengenes database works. Thanks!

packageVersion("dada2") [1] ‘1.2.0’ taxa.gg13_8.id <- assignTaxonomy(seqtab.nochim, "~/work/qiime/gg/dada/gg_13_8_train_set_97_wids.fa.gz") Error in C_assign_taxonomy(seqs, refs, ref.to.genus, tax.mat.int, verbose) : Memory allocation failed. system("gunzip -c ~/work/qiime/gg/dada/gg_13_8_train_set_97_wids.fa.gz|grep '>'| head -n4") 1111883 1111882 1111879 1111874

On Wed, Dec 14, 2016 at 10:08 PM, vmaffei notifications@github.com wrote:

Do you have a the original Greengene database files already? gg_13_8 here Try assigning taxonomy with an original Greengenes _otus.fasta file found within the "rep_set/" folder (replace with desired cluster percentage 97, 99, etc). The _otus.fasta file contains OTUIDs instead of OTUID-free Greengenes taxonomy, which are contained in the DADA2-optimized greengenes database. If interested, do a quick less on both files to check out the difference. After assignment, you can use read.table() to read in the matching _otu_taxonomy.txt file found in "taxonomy/" if you want to add the taxonomy in as well as OTUIDs for other non-picrust pipelines. Happy to add example code of that if interested.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/168#issuecomment-267233824, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvS7tu7AONpLeWlJpB7TcqT6A_Npwniks5rIL0_gaJpZM4LNrAY .

-- Chad Smith Postdoctoral Fellow Dept of Integrative Biology University of Texas at Austin 1 University Station C0990 Austin, TX 78712

Office: PAT 632 Phone: 512-471-7619 Fax: 512-471-3878 https://chadcsmith.wordpress.com/

vmaffei commented 7 years ago

Try passing in the seqs directly (if you aren't already) taxa.gg13_8.id <- assignTaxonomy(rownames(seqtab.nochim[1:5,]),

chadsmith123 commented 7 years ago

Thanks for your quick reply. Same error - I assume you mean 'colnames'? rownames returns the sample IDs.

rownames(seqtab.nochim[1:5,]) [1] "125-Plate-2-E4" "126-Plate-2-F4" "127-Plate-2-G4" "128-Plate-2-H4" "131-Plate-2-C5"

taxa.gg13_8.id <- assignTaxonomy(colnames(seqtab.nochim[1:5,]), "~/work/qiime/gg/dada/gg_13_8_train_set_97_wids.fa.gz") Error in C_assign_taxonomy(seqs, refs, ref.to.genus, tax.mat.int, verbose) : Memory allocation failed.

On Thu, Dec 15, 2016 at 11:17 AM, vmaffei notifications@github.com wrote:

Try passing in the seqs directly (if you aren't already) taxa.gg13_8.id <- assignTaxonomy(rownames(seqtab.nochim[1:5,]),

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/168#issuecomment-267386121, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvS7sGBcAB4AjBdISFIJ-ahU9EinESRks5rIXZAgaJpZM4LNrAY .

-- Chad Smith Postdoctoral Fellow Dept of Integrative Biology University of Texas at Austin 1 University Station C0990 Austin, TX 78712

Office: PAT 632 Phone: 512-471-7619 Fax: 512-471-3878 https://chadcsmith.wordpress.com/

vmaffei commented 7 years ago

Whoops, yes! where ever your seqs are (colnames or rownames). For seqs on cols, change seqtab.nochim[1:5,] to seqtab.nochim[,1:5] and pass in colnames(seqtab.nochim[,1:5]). Wanted to see if passing in just a few sequences (not samples) also produces the memory error.

assignTaxonomy(colnames(seqtab.nochim[,1:5]),"/rep_set/97_otus.fasta.gz",minBoot=80)

returns for me:

CCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGAAACC NA CCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACC NA CCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACC "1111294" CCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCATGTAGGCGGGCTTTTAAGTCTGACGTGAAAATGCGGGGCTTAACCCCGTATGGCGTTGGATACTGGAAGTCTTGAGTGCAGGAGAGGAAAGGGGAATTCCCAGTGTAGCGGTGAAATGCGTAGATATTGGGAGGAACACCAGTGGCGAAGGCGCCTTTCTGGACTGTGTCTGACGCTGAGATGCGAAAGCCAGGGTAGCAAACGGGATTAGAAACC
"185659" CCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCGCGCAGGCGGCATCGCAAGTCGGTCTTAAAAGTGCGGGGCTTAACCCCGTGAGGGGACCGAAACTGTGAAGCTCGAGTGTCGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAAGCGGCTTTCTGGACGACAACTGACGCTGAGGCGCGAAAGCCAGGGGAGCAAACGGGATTAGAAACC
NA

chadsmith123 commented 7 years ago

Fails when I try it. I tried the gg13_5 set and also got the same error. I tried your sequences, works with the DADA training set but not the gg13_8 training set with the ids in the header. Any advice on how to proceed?

colnames(seqtab.nochim[,1:5]) [1] "ATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTTA TTAGGTCTGATGTGAAAGCCCTCGGCTCAACCGAGGAAGTGCATCGGAAACCGGTAAACT TGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGA AGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGTAGCAAACAGG" [2] "ATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTG TTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAAACTGGCAGGCT TGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGA GGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG" [3] "ATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTCTT TTAAGTCTAATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATTGGAAACTGGGAGACT TGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGA AGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG" [4] "ATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTC TTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACT TGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGA AGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACAGG" [5] "ATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTG TTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAAACTGGCAAGCT AGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGA GGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG"

assignTaxonomy(colnames(seqtab.nochim[,1:5]), "~/qiime/gg/dada/gg_138 train_set_97_wids.fa.gz") Error in C_assign_taxonomy(seqs, refs, ref.to.genus, tax.mat.int, verbose) : Memory allocation failed.

On Thu, Dec 15, 2016 at 12:30 PM, vmaffei notifications@github.com wrote:

Whoops, yes! where ever your seqs are (colnames or rownames). For seqs on cols, change seqtab.nochim[1:5,] to seqtab.nochim[,1:5] and pass in colnames(seqtab.nochim[,1:5]). Wanted to see if passing in just a few sequences (not samples) also produces the memory error.

assignTaxonomy(colnames(seqtab.nochim[,1:5]),"/rep_ set/97_otus.fasta.gz",minBoot=80)

returns for me:

CCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAG ATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACT GGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGA TATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTC GAAAGTGTGGGTATCAAACAGGATTAGAAACC NA CCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAG ATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACT GGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGA TATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTC GAAAGTGTGGGTATCAAACAGGATTAGATACC NA CCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAG GCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACT GGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGA GATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGC GAAAGCGTGGGGAGCAAACAGGATTAGAAACC "1111294" CCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCATGTAG GCGGGCTTTTAAGTCTGACGTGAAAATGCGGGGCTTAACCCCGTATGGCGTTGGATACTG GAAGTCTTGAGTGCAGGAGAGGAAAGGGGAATTCCCAGTGTAGCGGTGAAATGCGTAGAT ATTGGGAGGAACACCAGTGGCGAAGGCGCCTTTCTGGACTGTGTCTGACGCTGAGATGCG AAAGCCAGGGTAGCAAACGGGATTAGAAACC "185659" CCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCGCGCAG GCGGCATCGCAAGTCGGTCTTAAAAGTGCGGGGCTTAACCCCGTGAGGGGACCGAAACTG TGAAGCTCGAGTGTCGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGAT ATTAGGAGGAACACCAGTGGCGAAAGCGGCTTTCTGGACGACAACTGACGCTGAGGCGCG AAAGCCAGGGGAGCAAACGGGATTAGAAACC NA

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/168#issuecomment-267404000, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvS7jYBJgXzOKEaqp7FuM_O09vFL6sLks5rIYdbgaJpZM4LNrAY .

-- Chad Smith Postdoctoral Fellow Dept of Integrative Biology University of Texas at Austin 1 University Station C0990 Austin, TX 78712

Office: PAT 632 Phone: 512-471-7619 Fax: 512-471-3878 https://chadcsmith.wordpress.com/

vmaffei commented 7 years ago

Interesting! Not quite sure what's wrong. Are both gzipped files relatively the same size (29M for gg_13_8_wids and 30M for DADA training set)? Any differences other than the headers? When I do

cp 97_otus.fasta.gz test_ids.gz cp gg_13_8_train_set_97.fa.gz test_noids.gz gunzip test* diff test_ids test_noids

the only differences I have are in the header names (taxa in DADA train set, OTUIDs in 97_otus.fasta)

EX:

1c1
< >1111883
---
> >k__Bacteria;p__Gemmatimonadetes;c__Gemm-1;o__;f__;g__;s__;
3c3
< >1111882
---
> >k__Bacteria;p__Bacteroidetes;c__Flavobacteriia;o__Flavobacteriales;f__Flavobacteriaceae;g__Flavobacterium;s__;
5c5
< >1111879
---
> >k__Bacteria;p__Chloroflexi;c__Chloroflexi;o__[Roseiflexales];f__[Roseiflexaceae];g__Roseiflexus;s__;

chadsmith123 commented 7 years ago

I tried it on another computer and it worked. I'm not sure what is up with my specific installation.

I merged the sequences with the run using gg dada2 taxonomy database - most of them do not have greengenes IDs - any idea why this could happen? Linked to the lack of taxonomy structure in the FASTA header in the greengenes ID database file perhaps?

p01.tax.ggid[c("Family","Genus","Species","ggid")] Family Genus Species ggid 1 Enterobacteriaceae Klebsiella 813217 2 Enterobacteriaceae 3 Enterobacteriaceae Pantoea agglomerans 4 Streptococcaceae Lactococcus garvieae 5 Lactobacillaceae Pediococcus 773251 6 Lactobacillaceae Pediococcus acidilactici 102222 7 Lactobacillaceae Pediococcus acidilactici 102222 8 Lactobacillaceae Lactobacillus 9 Lactobacillaceae 301921 10 Lactobacillaceae 11 Lactobacillaceae

On Thu, Dec 15, 2016 at 8:06 PM, vmaffei notifications@github.com wrote:

Interesting! Not quite sure what's wrong. Are both gzipped files relatively the same size (29M for gg_13_8_wids and 30M for DADA training set)? Any differences other than the headers? When I do

cp 97_otus.fasta.gz test_ids.gz cp gg_13_8_train_set_97.fa.gz test_noids.gz gunzip test* diff test_ids test_noids

the only differences I have are in the header names (taxa in DADA train set, OTUIDs in 97_otus.fasta)

EX:

1c1 < >1111883

kBacteria;pGemmatimonadetes;cGemm-1;o;f;g;s__; 3c3 < >1111882

kBacteria;pBacteroidetes;cFlavobacteriia;oFlavobacteriales;fFlavobacteriaceae;gFlavobacterium;s__; 5c5 < >1111879

kBacteria;pChloroflexi;cChloroflexi;o[Roseiflexales];f[Roseiflexaceae];gRoseiflexus;s__;

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/168#issuecomment-267499163, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvS7l-hptflUrRkE6N5a5WySlLTWncnks5rIfI_gaJpZM4LNrAY .

-- Chad Smith Postdoctoral Fellow Dept of Integrative Biology University of Texas at Austin 1 University Station C0990 Austin, TX 78712

Office: PAT 632 Phone: 512-471-7619 Fax: 512-471-3878 https://chadcsmith.wordpress.com/

chadsmith123 commented 7 years ago

Here are the specific sequences used as input into assignTaxonomy.

ATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG "813217" ATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTGGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG NA ATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGAAACTGGCAGGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGG NA ATACGTAGGTCCCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGTGGTTTCTTAAGTCTGATGTAAAAGGCAGTGGCTCAACCATTGTGTGCATTGGAAACTGGGAGACTTGAGTGCAGGAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGAGGCGAAAGCGGCTCTCTGGCCTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACAGG NA ATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTCTTTTAAGTCTAATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGCAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG "773251" ATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTCTTTTAAGTCTAATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG "102222" ATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGCGCAGGTGGTCTTTTAAGTCTAATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCGAACAGG "102222" ATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGCGAGTGCAGGCGGTTTATTAGGTCTGATGTGAAAGCCCTCGGCTCAACCGAGGAAGTGCATCGGAAACCGGTAAACTTGAGTGCAGAAGAGGAGAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGTAGCAAACAGG NA ATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACAGG "301921" ATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTCAACCGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACAGG NA ATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGATGTGAAAGCCTTCGGCTCAACTGAAGAAGTGCATCGGAAACTGGGAAACTTGAGTGCAGAAGAGGACAGTGGAACTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTGTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCATGGGTAGCAAACAGG NA

On Fri, Dec 16, 2016 at 3:15 PM, Chad Smith chadsmith@utexas.edu wrote:

I tried it on another computer and it worked. I'm not sure what is up with my specific installation.

I merged the sequences with the run using gg dada2 taxonomy database - most of them do not have greengenes IDs - any idea why this could happen? Linked to the lack of taxonomy structure in the FASTA header in the greengenes ID database file perhaps?

p01.tax.ggid[c("Family","Genus","Species","ggid")] Family Genus Species ggid 1 Enterobacteriaceae Klebsiella 813217 2 Enterobacteriaceae 3 Enterobacteriaceae Pantoea agglomerans 4 Streptococcaceae Lactococcus garvieae 5 Lactobacillaceae Pediococcus 773251 6 Lactobacillaceae Pediococcus acidilactici 102222 7 Lactobacillaceae Pediococcus acidilactici 102222 8 Lactobacillaceae Lactobacillus 9 Lactobacillaceae 301921 10 Lactobacillaceae 11 Lactobacillaceae

On Thu, Dec 15, 2016 at 8:06 PM, vmaffei notifications@github.com wrote:

Interesting! Not quite sure what's wrong. Are both gzipped files relatively the same size (29M for gg_13_8_wids and 30M for DADA training set)? Any differences other than the headers? When I do

cp 97_otus.fasta.gz test_ids.gz cp gg_13_8_train_set_97.fa.gz test_noids.gz gunzip test* diff test_ids test_noids

the only differences I have are in the header names (taxa in DADA train set, OTUIDs in 97_otus.fasta)

EX:

1c1 < >1111883

kBacteria;pGemmatimonadetes;cGemm-1;o;f;g;s__; 3c3 < >1111882

kBacteria;pBacteroidetes;cFlavobacteriia;oFlavobacteriales;fFlavobacteriaceae;gFlavobacterium;s__; 5c5 < >1111879

kBacteria;pChloroflexi;cChloroflexi;o[Roseiflexales];f[Roseiflexaceae];gRoseiflexus;s__;

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/168#issuecomment-267499163, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvS7l-hptflUrRkE6N5a5WySlLTWncnks5rIfI_gaJpZM4LNrAY .

-- Chad Smith Postdoctoral Fellow Dept of Integrative Biology University of Texas at Austin 1 University Station C0990 Austin, TX 78712

Office: PAT 632 Phone: 512-471-7619 <(512)%20471-7619> Fax: 512-471-3878 <(512)%20471-3878> https://chadcsmith.wordpress.com/

-- Chad Smith Postdoctoral Fellow Dept of Integrative Biology University of Texas at Austin 1 University Station C0990 Austin, TX 78712

Office: PAT 632 Phone: 512-471-7619 Fax: 512-471-3878 https://chadcsmith.wordpress.com/

vmaffei commented 7 years ago

You may want to wait for a dada2 contributor/author to comment before spending time on this, but here's my take on it (for what it's worth).

They are NA because they didn't meet the minimum bootstrap threshold (my example above, quoted below again, has a few NAs due to bootstrap values < 80, which I set in minBoot). You may have to relax the bootstrap threshold when using rep_set/97_otus.fasta to get a similar # of successful assignments as you get with gg_dada2_training. Long story short, the taxonomic hierarchy in the gg_dada2_training db gives rise to higher bootstrap results compared to the absent hierarchy info in rep_set/97_otus.fasta (RDP is "trained" to infer sequence taxonomy from level-wise grouping of taxonomic information in the db). Since both the train set and rep_set db are identical (excluding headers), I suppose another working solution would be to append the matching OTUIDs as the final "taxonomic level" to each header in a new, third database (k;p;c;o;f;g;s;**id**). If we go this route, we may want to rerun assignTaxonomy on the new db but include minBoot = 0 and outputBootstrap = TRUE (and taxLevels = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "id")) to help make decisions on whether or not to keep an assigned OTUID based on their bootstrapped confidence.

Gonna give this a try real quick...will report back!

CCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGAAACC NA CCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGAATTCGTGGTGTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCCTGCTAAGCTGCAACTGACATTGAGGCTCGAAAGTGTGGGTATCAAACAGGATTAGATACC NA CCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGAAACC "1111294" CCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCATGTAGGCGGGCTTTTAAGTCTGACGTGAAAATGCGGGGCTTAACCCCGTATGGCGTTGGATACTGGAAGTCTTGAGTGCAGGAGAGGAAAGGGGAATTCCCAGTGTAGCGGTGAAATGCGTAGATATTGGGAGGAACACCAGTGGCGAAGGCGCCTTTCTGGACTGTGTCTGACGCTGAGATGCGAAAGCCAGGGTAGCAAACGGGATTAGAAACC "185659" CCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCGCGCAGGCGGCATCGCAAGTCGGTCTTAAAAGTGCGGGGCTTAACCCCGTGAGGGGACCGAAACTGTGAAGCTCGAGTGTCGGAGAGGAAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAAGCGGCTTTCTGGACGACAACTGACGCTGAGGCGCGAAAGCCAGGGGAGCAAACGGGATTAGAAACC NA

note: outputBootstrap is available in the github install of dada2, if you don't have this already library(devtools); install_github("benjjneb/dada2")

benjjneb commented 7 years ago

Two answers:

First, to get the best k-mer hit, follow @vmaffei instructions above, but set minBoot=0. That will give you whatever GGID has the highest kmer overlap with the query sequence.

HOWEVER, understand that this method of assigning GGIDs is NOT the one used in the paper introducing and benchmarking Picrust, and as far as I know we do NOT know how well it will perform. The Picrust paper uses 97% closed reference OTU clustering to assign to GGIDs. Such clustering is not implemented in the dada2 package, and so we cannot provide a native implementation of the benchmarked Picrust workflow. As a result, we do not officially support such Picrust workflows.

ps: If anyone knows of benchmarking of kmer-based assignment of GG-IDs for use in Picrust, I'd love to see it. This is perhaps the most commonly requested feature on our forum here, so if we could provide official guidance we'd be happy to.

pps: Thanks to @vmaffei for being so helpful here, that is awesome.

vmaffei commented 7 years ago

anytime, @benjjneb ! @chadsmith123 just to follow through, I added OTUIDs from 97_otus.fasta.gz to the header ends in gg_13_8_train_set_97.fa.gz using the following:

gunzip -c gg_13_8_train_set_97.fa.gz | grep '>' > train_head
gunzip -c 97_otus.fasta.gz | grep -o '[0-9]*' > rep_set_head
sed 's/$/id__/' train_head > train_id
paste -d '' train_id rep_set_head > train_ids
sed 's/$/;/' train_ids > train_ids_head
gunzip -c 97_otus.fasta.gz | grep -o '[A-Z]*' > seqs_no_head
paste -d '\n' train_ids_head seqs_no_head > gg_13_8_train_ids_97.fasta
gzip gg_13_8_train_ids_97.fasta

using gg_13_8_train_ids_97.fasta.gz in assignTaxonomy will assign the OTUID along with the paired taxonomy so you don't have to add it back later. FYI this "third" database performs exactly the same as the rep_set/97_otus.fasta when using RDP since the IDs are now factored into the taxonomy.

vmaffei commented 7 years ago

@benjjneb I've been thinking about a DADA2 -> PICRUSt pipeline that might be useful (and possibly more accurate than a k-mer-based otuid assignments PICRUSt pipeline).

Punchline: why not perform KEGG genome predictions on the denoised amplicon sequences themselves through the PICRUSt-provided genome prediction tutorial?

The PICRUSt precalculated files that contain otuid-to-KEGG predicted genetic contents were (I believe) generated by building a 16S phylogenetic tree of IMG-genome-sequenced and IMG-genome-unsequenced organisms and using the information in the sequenced tips of the tree to infer the genetic content of the unsequenced tips. The scripts used to reproduce this workflow are provided in the PICRUSt documentation under the “genome prediction tutorial” https://picrust.github.io/picrust/tutorials/genome_prediction.html. The tutorial walks through genome prediction for all 13_5 greengenes otuids from a subset of the db (the subset with IMG-annotated sequencing/KEGG information). Instead of predicting the genomes of the remaining entries in the 13_5 db, any thoughts on simply swapping them with the denoised amplicon 16S sequences and subsequently predicting genomes for our sequence variants off the IMG-annotated subset?

I put together a rough pipeline to test this: 1) align 16S sequences with known/sequenced KEGG gene content and those with unknown KEGG gene content 2) build a tree on the alignments 3) run PICRUSt genome prediction tutorial https://picrust.github.io/picrust/tutorials/genome_prediction.html

For 1) I subsetted the 13_5 greengenes db for otuids that match/map to IMG genome identifiers in the IMG KEGG counts file (IMG_ko_counts.tab) using the gg-to-IMGv400 (not v350) mapping file available from gg http://greengenes.secondgenome.com/downloads/database/13_5. I added denoised 16S amplicon sequences as entries to the db (250 bp, V4, in-house data) and aligned with pynast with options -e 200 and -i 0.1 (6-7 gg 13_5 sequences fail to align with more stringent default parameters).

For 2) I ran Fasttree -nt -gamma -fastest -no2nd -spr 4 (taken from greengenes 13_5 README on tree building methods, see above link)

For 3) I used the same gg-to-IMGv400 mapping file as in 1).

Both the denoised amplicon sequences and the pre-sequenced otuid sequences were assigned KEGG gene contents. On gross examination (histogram of rowSums on amplicon vs. KEGG table), the denoised amplicon sequence KEGG count distribution seems to match that of the pre-sequenced otuid KEGG counts. I compared the KEGG assignments and counts of the pre-sequenced otuids directly to the KEGG assignments provided in the official PICRUSt ko_13_5_precalculated.tab file. 97% of the pre-sequenced otuid KEGG assignments and counts matched the ko_13_5_precalculated data after intersecting both datasets for matching KEGG identifiers (100 or so KEGG IDs were not found in both sets…not sure why…~6700 KEGG IDs remained). This may be due to gg-to-IMGv400 vs. v350 differences.

This is a pretty rough test, but all-in-all it may point to a DADA2 -> PICRUSt pipeline that leverages the denoising capabilities of DADA2. Any thoughts on this approach/how to better test the results? Happy to post code if interested.

benjjneb commented 7 years ago

@vmaffei

I think that is a great idea. Although I doubt the difference will be large, it seems very likely that making use of the higher resolution of dada2, and the density of sequenced genomes in a number of important taxa, can only improve functional prediction.

Providing a workflow that motivated folks could follow would be very helpful, there is definitely interest in this application.

Depending on how far you wanted to take things, a benchmark of the standard Picrust workflow (97% closed-ref OTUs) vs. the customized genome prediction and sequence variant approach could be very compelling. I.e. showing how well each correlates with shotgun functional gene measurements on some dataset with both shotgun metagenomics and 16S data (as was done in the paper describing Picrust). At some point that might become a paper in and of itself.

vmaffei commented 7 years ago

Sounds great. I've drafted a rough, working pipeline below for anyone interested in this particular approach to DADA2 -> PICRUSt. Multiple lines in part 3 of this workflow were taken from the official picrust genome prediction tutorial https://picrust.github.io/picrust/tutorials/genome_prediction.html

Inputs: http://greengenes.secondgenome.com/downloads/database/13_5

greengenes 13.5 fasta (gg_13_5.fasta, see link)
IMG ids to gg ids mapping file (gg_13_5_img.txt, see link)
picrust tutorial files (found in picrust/tutorials/picrust_starting_files.zip)

Outputs:

fasta containing 1) ASR-ready 13.5 16S sequences, 2) denoised DADA2 sequences
new 16S_13_5_precalculated.tab
new ko_13_5_precalculated.tab
PICRUSt-normalized .biom and -metagenome predictions .biom

Points of improvement:

16S sequence alignment (chosen aligner / options, are better available?)
Add KEGG BRITE hierarchy pathway information (cannot use categorize_by_function.py without this; currently waiting to hear back from KEGG about their API download policy)
Calculate new genome prediction confidence intervals (NSTI's)

Part 1: Start with R

1) Make study sequence db; subset gg_13_5.fasta for sequences with matching IMG kegg data (subsetting to the essentials greatly speeds up run-time)

2) Output sample count sequences for predict_metagenomes.py later

note: make sure to modify file locations and directories as you like them; this pipeline as written assumes the input files and folders (listed above) are in a folder named "genome_prediction" which is within your current directory.

# Dependencies
library(ShortRead)
library(biom) # use Joey's biom latest dev version; library(devtools); install_github("joey711/biom")
library(dada2)
# 1) Make study db
# read in gg 13_5 db
ref <- readFasta("genome_prediction/gg_13_5.fasta")
# read in IMG to ko
img_ko <- read.table(file = "genome_prediction/picrust_starting_files/IMG_ko_counts.tab", sep = "\t", row.names = 1, header = TRUE, comment.char = '')
# read in gg id to IMG
id_img <- read.table(file = "genome_prediction/gg_13_5_img.txt", sep = "\t", header = TRUE, comment.char = '')
# intersect gg id and ko list
img_ko_sub <- img_ko[rownames(img_ko) %in% id_img$img_genome_id, ]
id_img_sub <- id_img[id_img$img_genome_id %in% rownames(img_ko_sub), ]
# pull db ids
ids <- id(ref)
# subset db for gg w/ ko
ref_sub <- ref[as.character(ids) %in% id_img_sub$X.gg_id, ]
# build study db
ids_db <- as.character(id(ref_sub))
seqs_db <- as.character(sread(ref_sub))
# grab study seqs
load(file = "seqtab.nochim.robj")
seqs_study <- rownames(seqtab.nochim)
ids_study <- paste("study", 1:nrow(seqtab.nochim), sep = "_")
# merge db and study seqs
ids_out <- c(ids_db, ids_study)
seqs_out <- c(seqs_db, seqs_study)
fasta <- ShortRead(sread = DNAStringSet(seqs_out), id = BStringSet(ids_out))
writeFasta(fasta, file = "genome_prediction/gg_13_5_study_db.fasta")
# (optional) output gg id-to-IMG mapping info; modify this to filter later for predicted traits (predict_traits.py -l )
write.table(as.character(id(fasta)), file = "genome_prediction/traits_sample_filter.txt", sep = "\t", quote = FALSE, col.names = FALSE, row.names = FALSE)
# 2) output seq variant count data as biom;
# note: use Joey's biom latest dev version; library(devtools); install_github("joey711/biom")
seqtab_biom <- seqtab.nochim
rownames(seqtab_biom) <- ids_study
biom_object <- biom::make_biom(data = seqtab_biom)
biom::write_biom(biom_object, biom_file = "genome_prediction/sample_counts.biom")

Part 2: Align seqs and build tree

note: I chose pynast to align and fasttree to build the tree since they are quick, but any alignment / tree building combo is fair game for optimization

# align w/ pynast using QIIME scripts; the included options lessen alignment restrictions to prevent alignment failure
align_seqs.py -e 90 -p 0.1 -i ./genome_prediction/gg_13_5_study_db.fasta -o ./genome_prediction/gg_13_5_study_db.fasta.aligned
# build tree with fasttree; options are taken from greengenes 13_5 readme notes
FastTree -nt -gamma -fastest -no2nd -spr 4 ./genome_prediction/gg_13_5_study_db.fasta.aligned > ./genome_prediction/study_tree.tree

Part 3: Create new precalculated files

Adapted from https://picrust.github.io/picrust/tutorials/genome_prediction.html

# format 16S copy number data
format_tree_and_trait_table.py -t ./genome_prediction/study_tree.tree -i ./genome_prediction/picrust_starting_files/IMG_16S_counts.tab -m ./genome_prediction/gg_13_5_img.txt -o ./genome_prediction/format/16S/
# format kegg IMG data
format_tree_and_trait_table.py -t ./genome_prediction/study_tree.tree -i ./genome_prediction/picrust_starting_files/IMG_ko_counts.tab -m ./genome_prediction/gg_13_5_img.txt -o ./genome_prediction/format/KEGG/
# perform ancestral state reconstruction
ancestral_state_reconstruction.py -i ./genome_prediction/format/16S/trait_table.tab -t ./genome_prediction/format/16S/pruned_tree.newick -o ./genome_prediction/asr/16S_asr_counts.tab
ancestral_state_reconstruction.py -i ./genome_prediction/format/KEGG/trait_table.tab -t ./genome_prediction/format/KEGG/pruned_tree.newick -o ./genome_prediction/asr/KEGG_asr_counts.tab
# predict traits
predict_traits.py -i ./genome_prediction/format/16S/trait_table.tab -t ./genome_prediction/format/16S/reference_tree.newick -r ./genome_prediction/asr/16S_asr_counts.tab -o ./genome_prediction/predict_traits/16S_precalculated.tab
predict_traits.py -i ./genome_prediction/format/KEGG/trait_table.tab -t ./genome_prediction/format/KEGG/reference_tree.newick -r ./genome_prediction/asr/KEGG_asr_counts.tab -o ./genome_prediction/predict_traits/ko_precalculated.tab
# copy/move precalculated file to replace PICRUSt precalculated data files
cp ./genome_prediction/predict_traits/16S_precalculated.tab ./genome_prediction/predict_traits/16S_13_5_precalculated.tab
cp ./genome_prediction/predict_traits/ko_precalculated.tab ./genome_prediction/predict_traits/ko_13_5_precalculated.tab
# gzip files
gzip ./genome_prediction/predict_traits/16S_13_5_precalculated.tab
gzip ./genome_prediction/predict_traits/ko_13_5_precalculated.tab

important note from last step: ensure the final gzipped files are named exactly as above so they may successfully drop in and replace the official picrust precalculated files. On that note be sure to backup your original picrust precalculated files in a safe/convenient place. If you replace them on accident, no worries just re-download https://picrust.github.io/picrust/picrust_precalculated_files.html.

From here go ahead and move or copy the new .tab.gz files to your PICRUSt precalculated files directory.

cp ./genome_prediction/predict_traits/16S_13_5_precalculated.tab.gz picrust/data/
cp ./genome_prediction/predict_traits/ko_13_5_precalculated.tab.gz picrust/data/

Part 4: Run PICRUSt (finally!)

# run PICRUSt
normalize_by_copy_number.py -i ./genome_prediction/sample_counts.biom -o ./picrust/norm_counts.biom
predict_metagenomes.py -i ./picrust/norm_counts.biom -o ./picrust/meta_counts.biom

final note: the last 2 key missing pieces are 1) the KEGG BRITE pathway information for each KEGG ID and 2) the prediction confidence intervals.

I have a script ready to go for 1) that completes this step via the KEGG API although I'm reluctant to post it yet until they get back to me regarding their policy on "bulk data downloads" and this particular use. Anyone with a KEGG license though has access to these data freely through the KEGG FTP. Regarding 2) I'm not sure how NSTI values are calculated, but I think they would be pretty helpful for validating this workflow.

Please let me know if I should make edits! I ran this without errors not too long ago. Happy to discuss any part of this / incorporate suggested modifications from anyone interested.

yugi1001 commented 7 years ago

@vmaffei, is your seqtab.nochim.robj file in the same format as the output of DADA2? I had an error as shown below reading the fasta

fasta <- ShortRead(sread = DNAStringSet(seqs_out), id = BStringSet(ids_out)) Error in .Call2("new_XStringSet_from_CHARACTER", ans_class, ans_elementType, : key 117 (char 'u') not in lookup table

vmaffei commented 7 years ago

Hi @yugi1001 ,

I think mine is transposed, actually. Thanks for catching that!

try:

# grab study seqs
load(file = "seqtab.nochim.robj")
seqs_study <- colnames(seqtab.nochim)
ids_study <- paste("study", 1:ncol(seqtab.nochim), sep = "_")

and then further down:

# 2) output seq variant count data as biom;
# note: use Joey's biom latest dev version; library(devtools); install_github("joey711/biom")
seqtab_biom <- t(seqtab.nochim)

The snippet replaces rownames and nrow with colnames and ncol, respectively, and then transposes the seqtab for biom compatibility. If that works, let me know, and I'll update the main code above!

vmaffei commented 7 years ago

Added the pipeline to a new repo: https://github.com/vmaffei/dada2_to_picrust

yugi1001 commented 7 years ago

This works well. Thanks @vmaffei.

On Thu, Feb 9, 2017 at 8:09 PM, vmaffei notifications@github.com wrote:

Hi @yugi1001 https://github.com/yugi1001 ,

I think mine is transposed, actually. Thanks for catching that!

try:

grab study seqs

load(file = "seqtab.nochim.robj")seqs_study <- colnames(seqtab.nochim)idsstudy <- paste("study", 1:ncol(seqtab.nochim), sep = "")

and then further down:

2) output seq variant count data as biom;# note: use Joey's biom latest dev version; library(devtools); install_github("joey711/biom")seqtab_biom <- t(seqtab.nochim)

The snippet replaces rownames and nrow with colnames and ncol, respectively, and then transposes the seqtab for biom compatibility. If that works, let me know, and I'll update the main code above!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/benjjneb/dada2/issues/168#issuecomment-278660072, or mute the thread https://github.com/notifications/unsubscribe-auth/AKWJLa75oT0u4r7uXAsYTReFq0pAdYySks5rayU0gaJpZM4LNrAY .

Yugandhar Reddy

benjjneb commented 7 years ago

Added the pipeline to a new repo: https://github.com/vmaffei/dada2_to_picrust

That is pretty cool. Do you mind if I add a link to this to the dada2 website @vmaffei ?

vmaffei commented 7 years ago

@benjjneb I don't mind at all! Let me know if you get any useful feedback/criticism. I'm more than happy to develop this further. FYI, I'm currently in the process of benchmarking the method (as you suggested above), so at the moment, I don't quite know how well it performs compared to the official picrust pipeline yet. All the same, please share.

ChikiChickey commented 7 years ago

Hi I am a beginner for these types of analysis. So please let me ask elementaly questions. I completed Part 1, and skipped Part 2 & Part 3 and tried to conduct Part 4 because I thought Part 2 & Part 3 is optional (right?). However, after conducting the first line of Part 4, the error message below was displayed.

ValueError: No OTUs match identifiers in precalculated file. PICRUSt requires an OTU table 　reference/closed picked against GreenGenes. Example of the first 5 OTU ids from your table: study_128, study_129, study_124, study_125, study_126

I expect the OTU ids should be the Greengenes IDs. How should I do??

vmaffei commented 7 years ago

Hi @ChikiChickey ! Go ahead and give all 4 parts a try in the order listed. The only optional part of the pipeline is one small step within Part 1 that may speed up the analysis a little bit.

edit: to answer the rest of your question, after completing parts 1-4 you should expect greengenes IDs and study#'s, which are the IDs for your individual, denoised sequences that now have functional prediction entries in the pre-calculated files. Keep in mind that the study#'s generated for one project will not match the study_#'s for other projects subjected to the dada2_topicrust pipeline. Feel free to replace the "study" header with a header of your choosing in part 1 to help clarify later which precalculated files came from which study.

Ex:

ids_study <- paste("study", 1:ncol(seqtab.nochim), sep = "_")

may also be:

ids_study <- paste("16S_project_A", 1:ncol(seqtab.nochim), sep = "_")

This is especially helpful for distinguishing precalculated files produced in the pipeline from the official PICRUSt ones. Look inside the precalculated files with gunzip -c file.tab.gz | less -S and inspect for "16S_project_A" or run gunzip -c | grep "16S_project_A" and note any matches.

ChikiChickey commented 7 years ago

Thank you for your comment, @vmaffei . Sorry, I misunderstood Part 2 & 3 as optional because Part 4 seems to need only the file generated by Part 1. Ok, I'll try Part 2 and 3 and then 4. And thank you for your polite advice about the ID meenings. I could understand.

ChikiChickey commented 7 years ago

Hi @vmaffei

I run the second line of the Part 2 FastTree -nt -gamma -fastest -no2nd -spr 4 ./genome_prediction/gg_13_5_study_db.fasta.aligned > ./genome_prediction/study_tree.tree but the output file was 0 byte, and the message below was displayed

FastTree Version 2.1.3 SSE3 Alignment: ./genome_prediction/gg_13_5_study_db.fasta.aligned Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Fastest +NNI +SPR (4 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.50 ML Model: Jukes-Cantor, CAT approximation with 20 rate categories Error reading header line

Sorry, I could not understand the meaning. I already installed FastTree in current directory along [http://www.microbesonline.org/fasttree/#Install] Could you give me any advice ??

vmaffei commented 7 years ago

Hey @ChikiChickey , no problem. I usually get the "Error reading header line" error whenever FastTree cannot find the aligned fasta file itself. In your case, your input is the alignment folder gg_13_5_study_db.fasta.aligned/, but not the actual aligned fasta gg_13_5_study_db.fasta.aligned/gg_13_5_study_db_aligned.fasta Take a quick look to see if you've directly passed in the fasta.

ChikiChickey commented 7 years ago

Hi @vmaffei I really thank you for your total help!! I completed all of the step!! But finally I want to know how to convert the biom file meta_counts.biom into txt file or something. I got error when running biom convert -i ./picrust/meta_counts.biom -o ./picrust/meta_counts.txt --to-tsvand also biom convert -i ./picrust/meta_counts.biom -o ./picrust/meta_counts.txt --b.

The error message were

biom convert: error: no such option: --to-tsv

and

pyqi.core.exception.CommandError: Input does not look like a BIOM-formatted file. Did you accidentally specify that a classic table file should be created from a BIOM table file?

,respectively.

Could you give me any advice ? But i'm worrying that you are not responsible for such problem, because it seems to be the problem of my mac, especially not using --to-tsv. I'm sorry for asking everything and thank you so much.

vmaffei commented 7 years ago

@ChikiChickey biom convert -i meta_counts.biom -o meta_counts.tsv --to-tsv works for me! When I run biom --version I get version 2.1.5, so maybe check your version and consider updating if it's old. Or install a fresh copy on another computer and give it a shot. If that doesn't work, let's chat by e-mail to see if we can get you going.

ChikiChickey commented 7 years ago

@vmaffei Thank you so much !! I could convert my file when I used another Mac environment !! But, in my converted meta_counts file, KEGG_Description were not be shown. There were only OTU ID (ex. K0001, K0002...) and the number of each OTU contained in each sample. Is it OK ? Or am I wrong ? And if I want to know KEGG_Description, how should I do? Do you have any idea ?

vmaffei commented 7 years ago

@benjjneb I just posted a new version of the dada2_to_picrust pipeline with the results from a validation study comparing DADA2 -> PICRUSt via ASR as well as DADA2 -> PICRUSt via RDP kmers. In short, both pipelines (in their current form) perform essentially the same as the original PICRUSt pipeline, but at least now we may (more or less) associate predicted KEGG genomes as well as predicted 16S copy number directly to DADA2 sequence variants. If you or @joey711 are interested, it may be neat to add an "import_genome" function to phyloseq that links/bundles DADA2 sequence variants with their predicted 16S and KEGG gene counts for downstream predicted genomes analysis, 16S copy number correction, etc. Happy to work with you all on that if interested...also happy to incorporate improvements/tweaks to the dada2_to_picrust pipeline as well as the validation pipeline if anything comes to mind.

dada2_to_picrust, v2: https://github.com/vmaffei/dada2_to_picrust

benjjneb commented 7 years ago

@vmaffei

Awesome! What do you think @joey711 ? My first thought is that picrust annotation makes more sense in the phyloseq package, as phyloseq does the more complex analysis and annotations while dada2 is more about the raw data analysis. Or would a stand-alone package make the most sense?

SilasK commented 7 years ago

I wanted to suggest the piphillin server. It is more ore less a black-box. mapping any seq to the closest ref genome. But is more up to date than Picrust and can take any dada2 sequence. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0166104

benjjneb commented 7 years ago

Closing as effectively solved by @vmaffei at https://github.com/vmaffei/dada2_to_picrust

(thanks again!)