ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Download fasta file #257

Open MjelleLab opened 2 years ago

MjelleLab commented 2 years ago

Hi, I wonder where I can access the fasta files of the genomes for the RNA-viruses within Serratus.

asl commented 2 years ago

@MjelleLab Serratus did not assemble complete genomes of RNA viruses. Only RdRPs which are part of PalmDB (@ababaian please correct me if I'm wrong). However, there are some assemblies, in particular for all CoVs. See more information at https://github.com/ababaian/serratus/wiki/Assembly-Data

rcedgar commented 2 years ago

Actually we (meaning @rchikhi IIRC) did complete assemblies of hundreds(?) of metagenome libraries generating many contigs with full and partial RNA virus genomes from many different phyla and families. Hopefully the SQL API gives a way to identify the RdRP+ contigs within those assemblies, this would be @ababaian's department. Looks like a gap in the database and/or wiki documentation that we don't explain how to find the RNA virus contigs.

rchikhi commented 2 years ago

For the assemblies we generated, we only specifically looked at extracting CoV's at the time. For other viral families, a reasonable strategy would be to browser the serratus database to identify which SRA accessions have RdRP+ contigs, intersect that list of accession with the list of assemblies we generated, and then run a generic viral identification tool (e.g Virsorter) on those assemblies

rcedgar commented 2 years ago

@rchikhi I don't think that's correct, I'm pretty we did a macro-micro comparison where we made large batch of macro-assemblies (complete SRAs) to validate micro-assemblies (diamond hits only) as part of our QC for our protein search methodology. If you don't remember this I can try to dig up backups with my notes, unfortunately they're on an drive that recently got corrupted.

rcedgar commented 2 years ago

We used a couple of methods including Virsorter to classify palmprint+ contigs as viral / other as part of the same exercise. See Ext Data Fig 2(h) in the published paper: "(h) Kingdom predicted by Virsorter2 for RdRP+ contigs (by Palmscan) obtained by full assembly of 880 randomly chosen RdRP+ runs". These 880 runs were the successful assemblies from a list of 1k attempted.

rcedgar commented 2 years ago

We should post+document the RdRP+ contigs from those 880 complete assemblies if this is not already done; for sure something should be added to the Wiki page mentioned earlier in this issue thread.

ababaian commented 2 years ago

Short answer @MjelleLab, I'd lean towards Rayan's strategy, we provide an index of RdRP sequence/barcodes to identify where in the SRA a particular RNA virus (or those related) can be found. If this index is sufficient for you, I would suggest either try palmID with an input RdRP sequence to find which SRA libraries contain potential matches, or search through the micro-assembly data directly Explained here.

Long answer: as others have said, we have something like 56K assemblies, with like 50K of those being from Coronavirus libraries. You can download a list of SRA libraries with available assemblies with aws s3 ls s3://lovelywater/assembly/contigs/ and check for a DRR029953.coronaspades.contigs.fa.mfc file (note the MFC compression).

@rcedgar, I think a good SQL interface would be great :) We should slap it on the TODO list and integrate it into the web-UI.

ababaian commented 2 years ago

Maybe an addendum @MjelleLab, could you tell us what YOU would find most useful? We have organized the data internally within the project, but if we better understand use-cases from users we can offer better solutions in how we serve the available data.

rchikhi commented 2 years ago

Regarding https://github.com/ababaian/serratus/issues/257#issuecomment-1039597369: @rcedgar you're right, I had overlooked that experiment! It's "only" a subset of 880 assemblies, but indeed there are some ~potentially~ novel viral contigs in there.

rcedgar commented 2 years ago

Some of these had hundreds of viruses, I believe we found something like 10-20% of novel species in the 880 assemblies. Novel RdRPs are strongly concentrated in large metagenomes/viromes, and these were preferentially chosen by the random selection of the 1k subset.