bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
84 stars 26 forks source link

full SILVA LSU download #182

Closed sghignone closed 2 months ago

sghignone commented 2 months ago

Dear All, I'm in the need to update the taxonomy of some SILVA identifiers from LSU v 132 to the newer 138.1. I learned how to query SILVA for example as

qiime rescript get-silva-data \
--p-target LSURef \
--p-version 138.1 \
--p-include-species-labels \
--p-no-rank-propagation \
--p-ranks kingdom phylum class order family genus \
--p-no-download-sequences \
--o-silva-taxonomy silva_138.1_LSURef_taxonomy \
--o-silva-sequences silva_138.1_LSURef_sequences \
--verbose

in order to get the latest LSU Ref v. 138.1.

I have found that some older identifiers are no more present in LSURef, but are still accounted in the 'general' database, e.g. AZNG01000057.4678.8336 , ASQI01000343.2928.6301, MASP02000184.6051.9536, AGAX01000007.216954.220225, etc..

My question is: how to use rescript get-silva-data to get the full SILVA LSU database? not only the Ref nor the RefNR section.

mikerobeson commented 2 months ago

Hi @sghignone, I just pulled the LSU FASTA files for both 132 and 138.1 from here. This is also where RESCRIPt fetches these files from. You are right, those IDs appear to no longer exist within 138.1. Often sequences are added / removed between database versions due to updates in how they curate the data. This may be something you'd need to contact the SILVA team about. Can you clarify what you mean by the 'general' database.

sghignone commented 2 months ago

Hi Mike, thanks for checking. With 'general' I mean that part of the database not included in Ref nor in RefNR. I try to explain with an example: take the sequence AGAX01000007.216954.220225, and query https://www.arb-silva.de/search/ as sequence entry, without selecting Ref or RefNR. You will get Ganoderma lucidum G.260125-1, which does not appear in the LSU Ref 138.1 I have downloaded with this tool, but it's stored...somewhere ( general ). Any idea?

mikerobeson commented 2 months ago

No worries. :-)

RE general: Okay, that is what I figured. I also checked the search tool as you did earlier. Sadly, the Ref, RefNR99, and Parc files are the only downloadable exports as far as I am aware. In my experience, the downloadable files and the ARB files are only comprised of what makes it into Ref / RefNR99, which is what we pull from. 🤷

It sounds like you already have QZA files for 132... You can extract those sequences and taxonomy from 132 and then append them to the 138.1 sequence and taxonomy files, using the filtering and merging tools in base QIIME 2.