Data Compliation 1: CoV master table

ababaian commented 3 years ago

F1: CoV Phylogenetic tree

Objective: Create a master table for coronaviridae containing all GenBank records. This will in essence have to be split over several tsv files which will be stored in s3://serratus-public/seq/cov5/

Tasks

[x] Create "Master Table" google doc. Link
[x] Propose version1 table schema (Sheet 2 in doc). Discuss below.
[ ] Download cov5 global list. Fasta / GenBank / GFF records (Artem In progress)
[ ] Create ML phylogenetic tree for exemplar members of Nidovirales. Deliver tree files. (Pierre : 07/14-- RCE EDIT -- 14th is unrealistic, we need all PSI-Serratus assemblies completed and final OTU assignments before Pierre can make the tree, see #194 )

TSVs

cov5 query: "taxid11118[Organism:exp]" on 20/07/11. 43617 entries returned nido5 query: "txid76804[Organism:exp] NOT txid11118[Organism:exp]" on 20/07/11. 37050 entries returned

Master Table - SRA Run Info
- [x] All SRA RunInfo tables available at s3://lovelywater/sra/
- [x] sra.taxid.tsv.gz : Taxid annotation for SRA files processed in Serratus link
Master Table - Coronaviridae
- [x] cov5.acc : GenBank Accession list link
- [x] cov5.fa : Complete fasta records for cononaviridae. link
- [x] cov5.fa.fai : Fasta index records link
- [x] cov5.gb : GenBank records link
- [ ] cov5.tax : NCBI taxonomic classifications. Complete taxid hierarchy + scientific name
- [ ] cov5.host : NCBI taxonomic classification of virus host. Complete hierarchy where available + scientific name.
- [ ] cov5.stax : Serraplace + Serratax classifiction for each GenBank record
- [ ] cov5.aln : Closest global alignment of each record to: RefSeq, GenBank Whole Genome, GenBank all.
- [ ] cov5.ann : TSV of binary vector for the presence/absence of each major annotation feature.
Master Table - Nidovirales *
- [x] nido5.acc : GenBank Accession list link
- [x] nido5.fa : Complete fasta records for cononaviridae. link
- [ ] nido.fa.fai : Fasta index records
- [x] nido5.gb : GenBank records link
- [ ] nido5.tax : NCBI taxonomic classifications. Complete taxid hierarchy + scientific name
- [ ] nido5.host : NCBI taxonomic classification of virus host. Complete hierarchy where available + scientific name.
- [ ] nido5.stax : Serraplace + Serratax classifiction for each GenBank record
- [ ] nido5.aln : Closest global alignment of each record to: RefSeq, GenBank Whole Genome, GenBank all.
- [ ] nido5.ann : TSV of binary vector for the presence/absence of each major annotation feature.
Select Tables
- [x] toro5_cg.fa : Toroviridae - Complete Genomes (Outgroup) link](https://serratus-public.s3.amazonaws.com/seq/cov5/toro5_cg.fa)

ababaian commented 3 years ago

Currently Proposed Schema

Field	Notes
accession	GenBank Accession; OR Serratus Accession
name	Header for fasta record
exemplar	Accession of the selected "exemplar" of which this sequence is a member.
taxid	viral taxonomic ID, species level if possible
serratax_id	serratax inferred taxonomic ID
serraplace_id	serraplace inferred taxonomic ID
taxid_sub-genus	viral taxonomic ID for sub-genus; priority 1) taxid 2) consensus serratax/serraplace 3) human review. OTU identifier if unplaced.
taxid_genus	viral taxonomic ID for genus; priority 1) taxid 2) consensus serratax/serraplace 3) human review. OTU identifier if unplaced.
sra	Serratus assemblies, SRA accession; GenBank, NA
length	nucleotides in record
refseq_neighbour	Closest RefSeq record include self accession.
rs_pctid	Percent nucleotide identity to closest RefSeq
genome_neighbour	Closest GenBank CoV whole-genome. For GenBank records include self accession.
gn_pctid	Percent nucleotide identity to closest CoV genbank whole-genome
fragment_neighbour	Closest GenBank CoV record. For GenBank records include self accession.
fr_pctid	Percent nucleotide identity to closest genbank CoV fragment
5UTR	CV predicted 5 UTR present. T/F
3UTR	CV predicted 3 UTR present. T/F
RdRP	HMM predicted RdRP (pol) present. T/F
whole_genome	If 5UTR and 3UTR == T. Infer whole genome present.
host_taxid	Species of host taxonomic identifer when explicitely available
host_taxid_inferred	Host species taxonomic identifer including inferred records via our parsing
host_orderid	Order of host from host_taxid_inferred

rcedgar commented 3 years ago

@ababaian Please correct / confirm. "Exemplar" for tree-building: I assume this means one sequence per species or OTU, and suggest we use generically "OTU" for "species or sequence cluster when species name not available". I am constructing OTUs and will deliver to @Pbdas ASAP for tree-building.

ababaian commented 3 years ago

"Exemplar" is what you're saying, one sequence per species or OTU that we define as 'canonical'. This should follow the previous priority order we've established: 1) RefSeq 2) Genbank Whole Genome 3) GenBank fragment 4) Assembly. As such each cluster of sequences will be under the umbrella of a single "Exemplar" that will be named by the exemplar accession.

i.e. All SARS-CoV-2 sequences will contain "NC_045512" in this field as this is the highest ranking sequence in that OTU. Frank will be "serr1234" as that will be the highest ranking sequence in the OTU.

rcedgar commented 3 years ago

Actually, Genbank fragments cannot be included in OTUs because there is no way to measure identity of non-overlapping fragments -- they could be in the same species or highly diverged from each other. I doubt any fragment has a species name not assigned to a complete genome.

ababaian commented 3 years ago

Our operational definition of inclusion as discussed in the call is the presence of RdRP. Non-rdrp fragments can be listed as 'unclassfied' for now.

ababaian commented 3 years ago

I came across the ICTV list of 'exemplar' virus sequences in GenBank. ( s3://serratus-public/seq/cov5/VMR 010520 MSL35.xlsx )

We should certainly include these sequences as our 'exemplar' as well as it's a well annotated 'name' and species designation in different places in the tree.

https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/9603

rcedgar commented 3 years ago

Suggest update priority ranking for exemplars etc. to:

ICTV > RefSeq > GB complete > Serratus assembly

Assuming ICTV is "more official" than RefSeq.

rcedgar commented 3 years ago

We should add a.a. identities to closest known species for taxonomic genes per #195.

rcedgar commented 3 years ago

Suggest adding fields for nt & protein classifier scores as discussed in #197

rchikhi commented 3 years ago

an updated master table, with seq technology, also with data from the ~6k BGC-extracted assemblies (not just CheckV) https://serratus-rayan.s3.amazonaws.com/sra_master_table.csv

rchikhi commented 3 years ago

Note: 40 ONT datasets, which all seem 'fine'. They're all Sars-Cov-2. E.g.: accession	length	nb_contigs	category	serratax_id	serraplace_id	refseq_neighbour	refseq_pctid	genome_neighbour	genome_pctid	fragment_neighbour	fragment_pctid	platform
SRR11140745	29194	1	A	694009		NC_045512.2	100	MT263399.1	99.9	MT703964.1	81.2	OXFORD_NANOPORE

rcedgar commented 3 years ago

@rchikhi How are you identifying the neighbors? Minimap2? Are you using my python2 script to process the SAM records, or what?

rchikhi commented 3 years ago

Yes, minimap2, your python script, results only extracted from the first hit. https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/minimap2_contigs.sh It can be improved in two ways, following your feedback: 1) I'm using the original cov5.fa file as the set of complete genomes (uclust'd 97%). Still fine, or should I move to a newer one like nt_otus.id99.fa? 2) I can get stats instead from the longest contig, not just the first mapping hit.

rcedgar commented 3 years ago

Cov5.fa is not complete genomes, it's all Cov records from GB, including probable FPs. We can't use all GB complete genomes because some of them are identical or have only a couple of SNPs. I think we should use the subset of ~800 "compelete genomes" from the 99% OTUs https://serratus-public.s3.amazonaws.com/seq/cov5/nt_otus.id99.fa. This subset is not posted on S3 but is trivial to extract by selecting deflines with "complete genome".
Sounds reasonable.

rchikhi commented 3 years ago

I hadn't written enough info : I meant that the original cov5.fa was taken, but further split into complete/fragments and also clustered at 97%, according to https://github.com/ababaian/serratus/issues/196#issuecomment-658907089. But if you'd rather I rerun alignment to nt_otus.id99, it's def possible. EDIT: OK I just noticed above that you thought we should use nt_otus.id99, will rerun using that. How about the fragments? Should I take all of nt_otus.id99 that's not a complete genome, I'm assuming?
Ok will modify to have that hit instead.

rcedgar commented 3 years ago

Align to (a) nt_otus.id99 "complete genomes" and separately to (b) nt_otus.id99 fragments (=not "complete genome").

rchikhi commented 3 years ago

done, using updated scripts at: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/minimap2_contigs.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/master_table.py#L62 nt_otus.id99 complete/fragments at: https://serratus-rayan.s3.amazonaws.com/cov5/nt_otus.id99.complete.fa https://serratus-rayan.s3.amazonaws.com/cov5/nt_otus.id99.frag.fa updated master table at same location: https://serratus-rayan.s3.amazonaws.com/sra_master_table.csv

ababaian / serratus