Closed ababaian closed 3 years ago
Currently Proposed Schema
Field | Notes |
---|---|
accession | GenBank Accession; OR Serratus Accession |
name | Header for fasta record |
exemplar | Accession of the selected "exemplar" of which this sequence is a member. |
taxid | viral taxonomic ID, species level if possible |
serratax_id | serratax inferred taxonomic ID |
serraplace_id | serraplace inferred taxonomic ID |
taxid_sub-genus | viral taxonomic ID for sub-genus; priority 1) taxid 2) consensus serratax/serraplace 3) human review. OTU identifier if unplaced. |
taxid_genus | viral taxonomic ID for genus; priority 1) taxid 2) consensus serratax/serraplace 3) human review. OTU identifier if unplaced. |
sra | Serratus assemblies, SRA accession; GenBank, NA |
length | nucleotides in record |
refseq_neighbour | Closest RefSeq record include self accession. |
rs_pctid | Percent nucleotide identity to closest RefSeq |
genome_neighbour | Closest GenBank CoV whole-genome. For GenBank records include self accession. |
gn_pctid | Percent nucleotide identity to closest CoV genbank whole-genome |
fragment_neighbour | Closest GenBank CoV record. For GenBank records include self accession. |
fr_pctid | Percent nucleotide identity to closest genbank CoV fragment |
5UTR | CV predicted 5 UTR present. T/F |
3UTR | CV predicted 3 UTR present. T/F |
RdRP | HMM predicted RdRP (pol) present. T/F |
whole_genome | If 5UTR and 3UTR == T. Infer whole genome present. |
host_taxid | Species of host taxonomic identifer when explicitely available |
host_taxid_inferred | Host species taxonomic identifer including inferred records via our parsing |
host_orderid | Order of host from host_taxid_inferred |
@ababaian Please correct / confirm. "Exemplar" for tree-building: I assume this means one sequence per species or OTU, and suggest we use generically "OTU" for "species or sequence cluster when species name not available". I am constructing OTUs and will deliver to @Pbdas ASAP for tree-building.
"Exemplar" is what you're saying, one sequence per species or OTU that we define as 'canonical'. This should follow the previous priority order we've established: 1) RefSeq 2) Genbank Whole Genome 3) GenBank fragment 4) Assembly. As such each cluster of sequences will be under the umbrella of a single "Exemplar" that will be named by the exemplar accession.
i.e. All SARS-CoV-2 sequences will contain "NC_045512" in this field as this is the highest ranking sequence in that OTU. Frank will be "serr1234" as that will be the highest ranking sequence in the OTU.
Actually, Genbank fragments cannot be included in OTUs because there is no way to measure identity of non-overlapping fragments -- they could be in the same species or highly diverged from each other. I doubt any fragment has a species name not assigned to a complete genome.
Our operational definition of inclusion as discussed in the call is the presence of RdRP. Non-rdrp fragments can be listed as 'unclassfied' for now.
I came across the ICTV list of 'exemplar' virus sequences in GenBank. ( s3://serratus-public/seq/cov5/VMR 010520 MSL35.xlsx )
We should certainly include these sequences as our 'exemplar' as well as it's a well annotated 'name' and species designation in different places in the tree.
https://talk.ictvonline.org/taxonomy/vmr/m/vmr-file-repository/9603
Suggest update priority ranking for exemplars etc. to:
ICTV > RefSeq > GB complete > Serratus assembly
Assuming ICTV is "more official" than RefSeq.
We should add a.a. identities to closest known species for taxonomic genes per #195.
Suggest adding fields for nt & protein classifier scores as discussed in #197
an updated master table, with seq technology, also with data from the ~6k BGC-extracted assemblies (not just CheckV) https://serratus-rayan.s3.amazonaws.com/sra_master_table.csv
Note: 40 ONT datasets, which all seem 'fine'. They're all Sars-Cov-2. E.g.: accession | length | nb_contigs | category | serratax_id | serraplace_id | refseq_neighbour | refseq_pctid | genome_neighbour | genome_pctid | fragment_neighbour | fragment_pctid | platform |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SRR11140745 | 29194 | 1 | A | 694009 | NC_045512.2 | 100 | MT263399.1 | 99.9 | MT703964.1 | 81.2 | OXFORD_NANOPORE |
@rchikhi How are you identifying the neighbors? Minimap2? Are you using my python2 script to process the SAM records, or what?
Yes, minimap2, your python script, results only extracted from the first hit.
https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/minimap2_contigs.sh
It can be improved in two ways, following your feedback:
1) I'm using the original cov5.fa
file as the set of complete genomes (uclust'd 97%). Still fine, or should I move to a newer one like nt_otus.id99.fa
?
2) I can get stats instead from the longest contig, not just the first mapping hit.
cov5.fa
was taken, but further split into complete/fragments and also clustered at 97%, according to https://github.com/ababaian/serratus/issues/196#issuecomment-658907089. But if you'd rather I rerun alignment to nt_otus.id99
, it's def possible. EDIT: OK I just noticed above that you thought we should use nt_otus.id99, will rerun using that. How about the fragments? Should I take all of nt_otus.id99
that's not a complete genome, I'm assuming?done, using updated scripts at: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/minimap2_contigs.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/master_table/master_table.py#L62 nt_otus.id99 complete/fragments at: https://serratus-rayan.s3.amazonaws.com/cov5/nt_otus.id99.complete.fa https://serratus-rayan.s3.amazonaws.com/cov5/nt_otus.id99.frag.fa updated master table at same location: https://serratus-rayan.s3.amazonaws.com/sra_master_table.csv
F1: CoV Phylogenetic tree
Objective: Create a master table for coronaviridae containing all GenBank records. This will in essence have to be split over several
tsv
files which will be stored ins3://serratus-public/seq/cov5/
Tasks
cov5
global list. Fasta / GenBank / GFF records (Artem In progress)TSVs
cov5
query:"taxid11118[Organism:exp]"
on 20/07/11. 43617 entries returnednido5
query:"txid76804[Organism:exp] NOT txid11118[Organism:exp]"
on 20/07/11. 37050 entries returnedMaster Table - SRA Run Info
s3://lovelywater/sra/
sra.taxid.tsv.gz
: Taxid annotation for SRA files processed in Serratus linkMaster Table - Coronaviridae
cov5.acc
: GenBank Accession list linkcov5.fa
: Complete fasta records for cononaviridae. linkcov5.fa.fai
: Fasta index records linkcov5.gb
: GenBank records linkcov5.tax
: NCBI taxonomic classifications. Complete taxid hierarchy + scientific namecov5.host
: NCBI taxonomic classification of virus host. Complete hierarchy where available + scientific name.cov5.stax
: Serraplace + Serratax classifiction for each GenBank recordcov5.aln
: Closest global alignment of each record to: RefSeq, GenBank Whole Genome, GenBank all.cov5.ann
: TSV of binary vector for the presence/absence of each major annotation feature.Master Table - Nidovirales *
nido5.acc
: GenBank Accession list linknido5.fa
: Complete fasta records for cononaviridae. linknido.fa.fai
: Fasta index recordsnido5.gb
: GenBank records linknido5.tax
: NCBI taxonomic classifications. Complete taxid hierarchy + scientific namenido5.host
: NCBI taxonomic classification of virus host. Complete hierarchy where available + scientific name.nido5.stax
: Serraplace + Serratax classifiction for each GenBank recordnido5.aln
: Closest global alignment of each record to: RefSeq, GenBank Whole Genome, GenBank all.nido5.ann
: TSV of binary vector for the presence/absence of each major annotation feature.Select Tables
toro5_cg.fa
: Toroviridae - Complete Genomes (Outgroup) link](https://serratus-public.s3.amazonaws.com/seq/cov5/toro5_cg.fa)