Open rchikhi opened 3 years ago
All gene_clusters concatenated into a single file: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.gz
Diamond of all gene_clusters assemblies versus rdrp0_q_d: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6.gz
Diamond of all gene_clusters assemblies versus dicistro.protref.aa: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.diamond_vs_dicistro_protref.fmt6.gz
My analysis of the above data is posted on S3 here:
s3://serratus-public/rce/quenya_analysis/ s3://serratus-public/rce/dicistro_analysis/
Selection criteria are: contig length >300nt, RdRp/ORF1 id in range 50..75%.
Many assemblies have hits to multiple contigs, but many of these look like assembly problems where the same virus is assembled multiple times. Therefore, I select only the first matching hit from each SRA. These results may therefore underestimate the correct number in cases where there are multiple viruses in one SRA.
Each directory contains the following files:
first_hit_id50to75.fa -- this is the aligned query segment from the diamond output.
first_hit_id50to75_clustered_id70.fa -- This is first_hit_id50to75.fa clustered at 70% nt identity.
first_hit_id50to75.tsv -- tabbed file with fields 1. SRA, 2. contig, 3. contig_length, 4. pctid, 5. database_label.
Pathracer-seq-fs
analysis :
all files are in s3://serratus-public/assemblies/dicistro/analysis/
(1) the a.a. translations of RdRp according to PR
(PR applied to all_gene_clusters.fa
and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)
all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa
(2) those RdRp a.a. sequences clustered at 97%id
all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.fa
(3) diamond search of the clustered sequences vs. dicistro.protref.aa
all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6
(4) diamond search of the clustered sequences vs. rdrp0.
all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6
cmdlines used: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/cluster.sh
what did you mean by "I don't see any of these" in (4) above?
Ah sorry :) was a copy-paste of your message that I forgot to erase!
My analysis uploaded to: s3://serratus-public/rce/dicistro_analysis/
The RdRp a.a. sequences provided by Rayan above were clustered at 90% id ("species") and classified according to %id to rdrp0 and disistro,protref.aa.
Files are:
known_species.fa
FASTA containing sequences with >=90% match to any known RdRp.
novel_species.fa
FASTA containing sequences with no match >=90% to any known RdRp.
otutable.tsv
Tabbed file with one line per species giving top hit to rdrp0 and/or dicistro.protref with %identities.
type_counts.tsv
Number of species in each category:
11100 NovelFamily # species is <50% to any known RdRp
5980 NovelGenus_Other # 50..75% to known RdRp from non-Dicistro family
2196 KnownSpecies # >=90% to a known RdRp
1751 NovelSpecies_Other # 75..90% to known RdRp from non-Dicistro family
586 NovelGenus_Dicistro # 50..75% to known RdRp from Dicistro
197 Novel_NoMatch # Self-explanatory. Is this really an RdRp?
98 NovelSpecies_Dicistro # 75..90% to known Dicistro RdRp
Pathracer analysis versus all assembly graphs (not the gene_clusters this time):
all files are in s3://serratus-public/assemblies/dicistro/analysis/
(1) the a.a. translations of RdRp according to PR
(PR applied to each assembly graph and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)
all_assembly_graphs.RdRP_1234q.seqs.fa.gz
(2) those RdRp a.a. sequences clustered at 97%id
all_assembly_graphs.RdRP_1234q.seqs.centroids.fa.gz
(3) diamond search of the clustered sequences vs. dicistro.protref.aa
all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz̀
(4) diamond search of the clustered sequences vs. rdrp0.
all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz̀
scripts used: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_run.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_organize.sh
Results uploaded to s3://serratus-public/rce/dicistro_analysis/pr_graphs/
, same filenames as in my comment above. If this analysis is correct, there are 59k new species. Summary totals:
21886 NovelGenus_Other
20666 NovelFamily
10955 NovelSpecies_Other
6439 KnownSpecies
4310 Novel_NoMatch
1340 NovelGenus_Dicistro
99 NovelSpecies_Dicistro
Monkey work checking rce/novel_species.fa
file.
From: novel_species.fa
285 >SRR6400004;Type=Novel_NoMatch;Rdrp0=none;Dicistro=none;
286 KHTQTLKHTHTLKHTHKHTHKHTLKHSNTLKHTHTHTLKHTLKHTHTQTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTN
287 THTHTHTHTHTHTNTHTHTHTHTQTHTQTHTHSNTHTHTLKHTLKHTHSNTHSNTHTHTHTHTHTHTHTQTHTHTHTHTH
288 KHTLKHTHTQTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHKHTHTHTHTHTHTHKHTHTHTHTHTHTHSNTHSNTHTQTH
289 TQTHTHTHTHTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTHTHTHTHTQTH
290 TQTHTHSNTHTHTLKHTHTHTHTHTHTHTHTHTLKHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTLKHTLKHTHSNTHSN
291 THTHTHTHTHTHTHTQTHTHTHKHTHTHTHTHTNTHTNTHTHKHTHTHTQTHTQTHTHTHTHTHTHTQTHTHTHTHTHTH
292 TLKHTLKHTHSNTHSNTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTHTHTHTHSNTHSNTHTHTHTHSNT
Diamond finds no similarity of this in either rdrp0 or dicistro databases. This has no dicistro/rdrp0 match yet the HMM models hit. Traceback on the hit in pathracer, Anton working on this.
Yes, I found these a few days ago; there is a discussion with Anton on the slack . These are filtered out by the diamond E-value, but we should keep this issue in mind. PR does not report an E-value, so any long enough ORF will induce an alignment.
From: rce/novel_species.fa
Line: 5573 >SRR1036663;Type=NovelGenus_Other;rdrp0=rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1(70.4%);Dicistro=none;
EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINTEQMCLEDAMYGTD
GLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYV
AMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK
VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFE
ALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSI
FNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK
RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNTQDHVRSLCLLAWH
NGEEEYNKFLAKIRSVPI
Blastp hit: 70.93% from polio
Select seq gb|AFQ38633.1| RNA polymerase 3D [Human poliovirus 1] Human poliovirus 1 781 781 100% 0.0 70.93% 461 AFQ38633.1
Query 1 EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL 60
EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL
Sbjct 26 EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL 85
Query 61 MSLDINTEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLS 120
MSLDINTEQMCLEDAMYGTDGLEALDLSTS
Sbjct 86 MSLDINTEQMCLEDAMYGTDGLEALDLSTS------------------------------ 115
Query 121 TSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEM 180
AGYPYVA+GKKKRDI+NKQTRDTKE M
Sbjct 116 --AGYPYVALGKKKRDIMNKQTRDTKE--------------------------------M 141
Query 181 QKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK 240
Q+LLDTYGINLPLVTYVKDEL RS+TK
Sbjct 142 QRLLDTYGINLPLVTYVKDEL----------------------------------RSRTK 167
Query 241 VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEE 300
VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGV+TGSAVGCDPDLFWSKIPVLMEE
Sbjct 168 VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVVTGSAVGCDPDLFWSKIPVLMEE 227
Query 301 KLFAFDYTGYDASLSPAWFEALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP 360
KLFAFDYTGYDASLSPAWFEALKM LEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP
Sbjct 228 KLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP 287
Query 361 SGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKG 420
SGCSGTSIFNSM INNLIIRTLLLKTYKG
Sbjct 288 SGCSGTSIFNSM--------------------------------INNLIIRTLLLKTYKG 315
Query 421 IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK 480
IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT
Sbjct 316 IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT--- 372
Query 481 RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRW 540
FLKRFFRADEKYPFLIHPVMPMKEIHESIRW
Sbjct 373 -----------------------------FLKRFFRADEKYPFLIHPVMPMKEIHESIRW 403
Query 541 TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI 578
TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI
Sbjct 404 TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI 441
That library contains a good clean hit to polio, and the hit above is from "another assembly" I am guessing.
From rc/all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa
>Score=79.0939|Bitscore=161.231|PartialBitscore=161.231|Seq=SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9|Position=4995|Frameshifts=0|Alignment=29M3D12M1D9M1D21M6D39M2D15M8D56M1I7M1D12M3D28M1D10M4D13M1D46M10D16M2D21M12D10M57D5M
From: .rc/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6
SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9 rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1 99.7 339 1 0 5029 6045 1 339 1.5e-195 680.2 6046 425 GTGAAGGAACCAGCAGTCCTCACTAAAAACGATCCCAGGCTTAAGACAGACTTTGAGGAGGCAATTTTCTCCAAGTACGTGGGTAACAAAATTACTGAAGTGGATGAGTACATGAAAGAGGCAGTAGACCACTATGCTGGCCAGCTCATGTCACTAGACATCAACACAGAACAAATGTGCTTGGAGGATGCCATGTATGGCACTGATGGTCTAGAAGCACTTGATTTGTCCACCAGTGCTGGCTACCCTTATGTAGCAATGGGAAAGAAGAAGAGAGACATCTTGAACAAACAAACCAGAGACACTAAGGAAATGCAAAAACTGCTCGACACATATGGAATCAACCTCCCACTGGTGACTTATGTAAAGGATGAACTTAGATCCAAAACAAAGGTTGAGCAGGGGAAATCCAGATTAATTGAAGCTTCTAGTTTGAATGACTCAGTGGCAATGAGAATGGCTTTTGGGAACCTATATGCTGCTTTTCACAAAAACCCAGGAGTGATAACAGGTTCAGCAGTGGGGTGCGATCCAGATTTGTTTTGGAGCAAAATTCCGGTATTGATGGAAGAGAAGCTGTTTGCTTTTGACTACACAGGGTATGATGCATCTCTCAGCCCTGCTTGGTTCGAGGCACTAAAGATGGTGCTTGAGAAAATCGGATTCGGAGACAGAGTTGACTACATCGACTACCTAAACCACTCACACCACCTGTACAAGAATAAAACATACTGTGTCAAGGGCGGTATGCCATCTGGCTGCTCAGGCACTTCAATTTTTAACTCAATGATTAACAACTTGATTATCAGGACACTCTTACTGAAAACCTACAAGGGCATAGATTTAGACCACCTAAAAATGATTGCCTATGGTGATGATGTAATTGCTTCCTACCCCCATGAAGTTGACGCTAGTCTCCTAGCCCAATCAGGAAAAGACTATGGACTAACTATGACTCCAGCTGACAAATCAGCTACATTTGAAACAGTCACATGGGAGAATGTAACATTCTTGAAG VKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINIEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK
I've uploaded PR hits to s3://serratus-public/assemblies/dicistro/analysis/
split into two groups:
all_assembly_graphs.RdRP_1234q.scaffold_seqs.fa.gz
– all hits supported by scaffoldsall_assembly_graphs.RdRP_1234q.non_scaffold_seqs.fa.gz
– all other hitsFurther uploaded:
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.fa.gz
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz
Here is the assembly graph from @asl for poliovirus:
The issue is when within a single assembly-graph there are multiple paths which provide a RdRp match, each one is being returned, this is yielding indels as seen above in slippage-prone viruses like polio (or other RNA viruses).
Proposed Solution: 1. For each RdRp output graph, allow each rdrp-containing edge (red highlight above) to be used at most ONCE per reported sequences. The "top hit" will be selected by percent-identity / top-score to a known virus (thus assuming the hit is not novel). The risk is if there is a novel virus in the same library as a known virus and they share homology over an edge of say 50 amino acids, then the novel virus would be excluded as the known virus takes priority. The benefit is this will reduce intra-sample viral variants to the most conservative.
One more caveat, assume rdrp is the red region in the graph above and the viral genome is green. If a sub-graph (blue) has a higher identity match then the longest match (green), but does not contain an end-to-end RdRp domain, the longest match with variants should take priority.
Analysis of high-trust scaffold sequences uploaded to:
s3://serratus-public/rce/dicistro_analysis/scaffolds/
15538 NovelFamily
9746 NovelGenus_Other
2252 NovelSpecies_Other
1713 KnownSpecies
877 Novel_NoMatch
481 NovelGenus_Dicistro
94 NovelSpecies_Dicistro
I extracted the assembled scaffolds that contain all of Anton's centroids (all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.fa.gz
from https://github.com/ababaian/serratus/issues/239#issuecomment-751520032).
Results are in: all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.scaffold_nucls.fa
in s3://serratus-public/assemblies/dicistro/analysis/
generated using: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_get_scaffolds.py note: the script failed to get a scaffold for 2% (760) of the centroids, let me know if you need to recover them all.
Coronaspades' gene_clusters file are here:
s3://serratus-public/assemblies/dicistro/gene_clusters/
Other coronaspades files:s3://serratus-public/assemblies/dicistro/other/
4921 assemblies made out of 5442 accessions:
s3://serratus-public/assemblies/dicistro/analysis/list_assembled_dicistro.txt