ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Data Compliation 1: CoV master table #193

Closed ababaian closed 3 years ago

ababaian commented 3 years ago

F1: CoV Phylogenetic tree

Objective: Create a master table for coronaviridae containing all GenBank records. This will in essence have to be split over several tsv files which will be stored in s3://serratus-public/seq/cov5/

Tasks

TSVs

cov5 query: "taxid11118[Organism:exp]" on 20/07/11. 43617 entries returned nido5 query: "txid76804[Organism:exp] NOT txid11118[Organism:exp]" on 20/07/11. 37050 entries returned

ababaian commented 3 years ago

Currently Proposed Schema

Field Notes
accession GenBank Accession; OR Serratus Accession
name Header for fasta record
exemplar Accession of the selected "exemplar" of which this sequence is a member.
taxid viral taxonomic ID, species level if possible
serratax_id serratax inferred taxonomic ID
serraplace_id serraplace inferred taxonomic ID
taxid_sub-genus viral taxonomic ID for sub-genus; priority 1) taxid 2) consensus serratax/serraplace 3) human review. OTU identifier if unplaced.
taxid_genus viral taxonomic ID for genus; priority 1) taxid 2) consensus serratax/serraplace 3) human review. OTU identifier if unplaced.
sra Serratus assemblies, SRA accession; GenBank, NA
length nucleotides in record
refseq_neighbour Closest RefSeq record include self accession.
rs_pctid Percent nucleotide identity to closest RefSeq
genome_neighbour Closest GenBank CoV whole-genome. For GenBank records include self accession.
gn_pctid Percent nucleotide identity to closest CoV genbank whole-genome
fragment_neighbour Closest GenBank CoV record. For GenBank records include self accession.
fr_pctid Percent nucleotide identity to closest genbank CoV fragment
5UTR CV predicted 5 UTR present. T/F
3UTR CV predicted 3 UTR present. T/F
RdRP HMM predicted RdRP (pol) present. T/F
whole_genome If 5UTR and 3UTR == T. Infer whole genome present.
host_taxid Species of host taxonomic identifer when explicitely available
host_taxid_inferred Host species taxonomic identifer including inferred records via our parsing
host_orderid Order of host from host_taxid_inferred
rcedgar commented 3 years ago

@ababaian Please correct / confirm. "Exemplar" for tree-building: I assume this means one sequence per species or OTU, and suggest we use generically "OTU" for "species or sequence cluster when species name not available". I am constructing OTUs and will deliver to @Pbdas ASAP for tree-building.

ababaian commented 3 years ago

"Exemplar" is what you're saying, one sequence per species or OTU that we define as 'canonical'. This should follow the previous priority order we've established: 1) RefSeq 2) Genbank Whole Genome 3) GenBank fragment 4) Assembly. As such each cluster of sequences will be under the umbrella of a single "Exemplar" that will be named by the exemplar accession.

i.e. All SARS-CoV-2 sequences will contain "NC_045512" in this field as this is the highest ranking sequence in that OTU. Frank will be "serr1234" as that will be the highest ranking sequence in the OTU.

rcedgar commented 3 years ago

Actually, Genbank fragments cannot be included in OTUs because there is no way to measure identity of non-overlapping fragments -- they could be in the same species or highly diverged from each other. I doubt any fragment has a species name not assigned to a complete genome.

ababaian commented 3 years ago

Our operational definition of inclusion as discussed in the call is the presence of RdRP. Non-rdrp fragments can be listed as 'unclassfied' for now.

ababaian commented 3 years ago

I came across the ICTV list of 'exemplar' virus sequences in GenBank. ( s3://serratus-public/seq/cov5/VMR 010520 MSL35.xlsx )

We should certainly include these sequences as our 'exemplar' as well as it's a well annotated 'name' and species designation in different places in the tree.

https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/9603

rcedgar commented 3 years ago

Suggest update priority ranking for exemplars etc. to:

ICTV > RefSeq > GB complete > Serratus assembly

Assuming ICTV is "more official" than RefSeq.

rcedgar commented 3 years ago

We should add a.a. identities to closest known species for taxonomic genes per #195.

rcedgar commented 3 years ago

Suggest adding fields for nt & protein classifier scores as discussed in #197

rchikhi commented 3 years ago

an updated master table, with seq technology, also with data from the ~6k BGC-extracted assemblies (not just CheckV) https://serratus-rayan.s3.amazonaws.com/sra_master_table.csv

rchikhi commented 3 years ago
Note: 40 ONT datasets, which all seem 'fine'. They're all Sars-Cov-2. E.g.: accession length nb_contigs category serratax_id serraplace_id refseq_neighbour refseq_pctid genome_neighbour genome_pctid fragment_neighbour fragment_pctid platform
SRR11140745 29194 1 A 694009   NC_045512.2 100 MT263399.1 99.9 MT703964.1 81.2 OXFORD_NANOPORE
rcedgar commented 3 years ago

@rchikhi How are you identifying the neighbors? Minimap2? Are you using my python2 script to process the SAM records, or what?

rchikhi commented 3 years ago

Yes, minimap2, your python script, results only extracted from the first hit. https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/minimap2_contigs.sh It can be improved in two ways, following your feedback: 1) I'm using the original cov5.fa file as the set of complete genomes (uclust'd 97%). Still fine, or should I move to a newer one like nt_otus.id99.fa? 2) I can get stats instead from the longest contig, not just the first mapping hit.

rcedgar commented 3 years ago
  1. Cov5.fa is not complete genomes, it's all Cov records from GB, including probable FPs. We can't use all GB complete genomes because some of them are identical or have only a couple of SNPs. I think we should use the subset of ~800 "compelete genomes" from the 99% OTUs https://serratus-public.s3.amazonaws.com/seq/cov5/nt_otus.id99.fa. This subset is not posted on S3 but is trivial to extract by selecting deflines with "complete genome".
  2. Sounds reasonable.
rchikhi commented 3 years ago
  1. I hadn't written enough info : I meant that the original cov5.fa was taken, but further split into complete/fragments and also clustered at 97%, according to https://github.com/ababaian/serratus/issues/196#issuecomment-658907089. But if you'd rather I rerun alignment to nt_otus.id99, it's def possible. EDIT: OK I just noticed above that you thought we should use nt_otus.id99, will rerun using that. How about the fragments? Should I take all of nt_otus.id99 that's not a complete genome, I'm assuming?
  2. Ok will modify to have that hit instead.
rcedgar commented 3 years ago
  1. Align to (a) nt_otus.id99 "complete genomes" and separately to (b) nt_otus.id99 fragments (=not "complete genome").
rchikhi commented 3 years ago

done, using updated scripts at: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/minimap2_contigs.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/master_table.py#L62 nt_otus.id99 complete/fragments at: https://serratus-rayan.s3.amazonaws.com/cov5/nt_otus.id99.complete.fa https://serratus-rayan.s3.amazonaws.com/cov5/nt_otus.id99.frag.fa updated master table at same location: https://serratus-rayan.s3.amazonaws.com/sra_master_table.csv