biocore / q2-greengenes2

A QIIME 2 plugin for interaction with the Greengenes2 database
BSD 3-Clause "New" or "Revised" License
26 stars 3 forks source link

What file(s) on the ftp site can I use as a reference taxonomy for classification? #8

Closed ilnamkang closed 1 year ago

ilnamkang commented 1 year ago

Hi,

What file(s) on the ftp site can I use as a reference taxonomy for classification?

In other words, what file should I put in the place indicated bold in the command below? Is that file available in the ftp site?

qiime greengenes2 taxonomy-from-features \ --i-reference-taxonomy \ --i-reads <your_FeatureData[Sequence]> \ --o-classification

I've tried using the four files below, but all failed. 2022.10.taxonomy.id.nwk.qza 2022.10.taxonomy.asv.nwk.qza 2022.10.phylogeny.id.nwk.qza 2022.10.phylogeny.asv.nwk.qza

The error messages were the same. ----- Plugin error from greengenes2:

No requested tips found

Debug info has been saved to /tmp/qiime2-q2cli-err-4v99y4et.log -----

Thanks.

wasade commented 1 year ago

Hi @ilnamkang, sorry for the delay in reply. Can you describe how your FeatureData[Sequence] artifact was constructed? If the identifiers are ASVs, then you would want to use 2022.10.phylogeny.asv.nwk.qza. If they are MD5s, you would want to use 2022.10.phylogeny.md5.nwk.qza.

ilnamkang commented 1 year ago

Thank you for a reply.

1) I made my FeatureData[Sequence] artifact using the following command.

qiime tools import --input-path test.fasta --output-path test.qza --type 'FeatureData[Sequence]'

The input sequence file (test.fasta) contains 16S rRNA gene sequences from my cultured isolates (not amplicon sequences) like below.

----- >IMC1234 TTGAACGCTGGCGGCATGCCTAACACATGCAAGTCGA~~ >IMC23451 GGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATG~~ >IMC26040 AACGAACGCTGGCGGCGTGGATAAGACATGCAAGTTG~~ -----

2) Unfortunately, the two files you've suggested did not work, with the same error messages as before.

Thanks.

Ilnam

wasade commented 1 year ago

Oh I see, thanks. The taxonomy-from-features action assumes the records being looked up exist exactly, by the sequence identifier, within the Greengenes2 database. What you would need to do here is, I believe, is use the non-v4-16s action which performs a closed reference assessment against the full length records in the backbone. To use non-v4-16s, you'll need to first construct a FeatureTable[Frequency] object from your data. Typically, the methods to construct those artifacts assume there are many samples, and each sample has many features. I'm not sure if that's the case for your data, although you may be able to fake it.

Alternatively, if you're interest is to compare your sequences against Greengenes2, it would be possible to take your FASTA and use BLAST (or other related methods) against the full length records in the backbone. If that's the case, I'd be happy to suggest a set of commands

ilnamkang commented 1 year ago

Thank you for your explanation.

It seems that "taxonomy-from-features" would not serve my purpose.

I think that "feature-classifier classify-sklearn" using "2022.10.backbone.full-length.nb.qza" file as a pre-trained classifier would be suitable for my data. (https://github.com/biocore/q2-greengenes2/issues/9)

Ilnam

wasade commented 1 year ago

Ah, yes, good call! I think that makes perfect sense