Dicistro assemblies - Githubissues

rchikhi commented 3 years ago

Coronaspades' gene_clusters file are here: s3://serratus-public/assemblies/dicistro/gene_clusters/ Other coronaspades files: s3://serratus-public/assemblies/dicistro/other/

4921 assemblies made out of 5442 accessions: s3://serratus-public/assemblies/dicistro/analysis/list_assembled_dicistro.txt

rchikhi commented 3 years ago

All gene_clusters concatenated into a single file: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.gz

Diamond of all gene_clusters assemblies versus rdrp0_q_d: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6.gz

Diamond of all gene_clusters assemblies versus dicistro.protref.aa: s3://serratus-public/assemblies/dicistro/analysis/all_gene_clusters.fa.diamond_vs_dicistro_protref.fmt6.gz

rcedgar commented 3 years ago

My analysis of the above data is posted on S3 here:

s3://serratus-public/rce/quenya_analysis/ s3://serratus-public/rce/dicistro_analysis/

Selection criteria are: contig length >300nt, RdRp/ORF1 id in range 50..75%.

Many assemblies have hits to multiple contigs, but many of these look like assembly problems where the same virus is assembled multiple times. Therefore, I select only the first matching hit from each SRA. These results may therefore underestimate the correct number in cases where there are multiple viruses in one SRA.

Each directory contains the following files:

first_hit_id50to75.fa -- this is the aligned query segment from the diamond output.

first_hit_id50to75_clustered_id70.fa -- This is first_hit_id50to75.fa clustered at 70% nt identity.

first_hit_id50to75.tsv -- tabbed file with fields 1. SRA, 2. contig, 3. contig_length, 4. pctid, 5. database_label.

rchikhi commented 3 years ago

Pathracer-seq-fs analysis :

all files are in s3://serratus-public/assemblies/dicistro/analysis/

(1) the a.a. translations of RdRp according to PR

(PR applied to all_gene_clusters.fa and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa

(2) those RdRp a.a. sequences clustered at 97%id

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.fa

(3) diamond search of the clustered sequences vs. dicistro.protref.aa

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6

(4) diamond search of the clustered sequences vs. rdrp0.

all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6

cmdlines used: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/cluster.sh

rcedgar commented 3 years ago

what did you mean by "I don't see any of these" in (4) above?

rchikhi commented 3 years ago

Ah sorry :) was a copy-paste of your message that I forgot to erase!

rcedgar commented 3 years ago

My analysis uploaded to: s3://serratus-public/rce/dicistro_analysis/

The RdRp a.a. sequences provided by Rayan above were clustered at 90% id ("species") and classified according to %id to rdrp0 and disistro,protref.aa.

Files are:

known_species.fa FASTA containing sequences with >=90% match to any known RdRp.

novel_species.fa FASTA containing sequences with no match >=90% to any known RdRp.

otutable.tsv Tabbed file with one line per species giving top hit to rdrp0 and/or dicistro.protref with %identities.

type_counts.tsv Number of species in each category:

11100   NovelFamily # species is <50% to any known RdRp
5980    NovelGenus_Other # 50..75% to known RdRp from non-Dicistro family
2196    KnownSpecies # >=90% to a known RdRp
1751    NovelSpecies_Other # 75..90% to known RdRp from non-Dicistro family
586     NovelGenus_Dicistro # 50..75% to known RdRp from Dicistro
197     Novel_NoMatch # Self-explanatory. Is this really an RdRp?
98      NovelSpecies_Dicistro # 75..90% to known Dicistro RdRp

rchikhi commented 3 years ago

Pathracer analysis versus all assembly graphs (not the gene_clusters this time):

all files are in s3://serratus-public/assemblies/dicistro/analysis/

(1) the a.a. translations of RdRp according to PR

(PR applied to each assembly graph and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)

all_assembly_graphs.RdRP_1234q.seqs.fa.gz

(2) those RdRp a.a. sequences clustered at 97%id

all_assembly_graphs.RdRP_1234q.seqs.centroids.fa.gz

(3) diamond search of the clustered sequences vs. dicistro.protref.aa

all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz̀

(4) diamond search of the clustered sequences vs. rdrp0.

all_assembly_graphs.RdRP_1234q.seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz̀

scripts used: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_run.sh https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_organize.sh

rcedgar commented 3 years ago

Results uploaded to s3://serratus-public/rce/dicistro_analysis/pr_graphs/, same filenames as in my comment above. If this analysis is correct, there are 59k new species. Summary totals:

21886   NovelGenus_Other
20666   NovelFamily
10955   NovelSpecies_Other
6439    KnownSpecies
4310    Novel_NoMatch
1340    NovelGenus_Dicistro
99      NovelSpecies_Dicistro

ababaian commented 3 years ago

Monkey work checking rce/novel_species.fa file.

Error Type 1: Low complexity sequences

From: novel_species.fa

    285 >SRR6400004;Type=Novel_NoMatch;Rdrp0=none;Dicistro=none;
    286 KHTQTLKHTHTLKHTHKHTHKHTLKHSNTLKHTHTHTLKHTLKHTHTQTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTN
    287 THTHTHTHTHTHTNTHTHTHTHTQTHTQTHTHSNTHTHTLKHTLKHTHSNTHSNTHTHTHTHTHTHTHTQTHTHTHTHTH
    288 KHTLKHTHTQTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHKHTHTHTHTHTHTHKHTHTHTHTHTHTHSNTHSNTHTQTH
    289 TQTHTHTHTHTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTHTHTHTHTQTH
    290 TQTHTHSNTHTHTLKHTHTHTHTHTHTHTHTHTLKHTHTHTQTHTQTHTLKHTLKHTHTHTHTHTLKHTLKHTHSNTHSN
    291 THTHTHTHTHTHTHTQTHTHTHKHTHTHTHTHTNTHTNTHTHKHTHTHTQTHTQTHTHTHTHTHTHTQTHTHTHTHTHTH
    292 TLKHTLKHTHSNTHSNTHTHTHTHTHSNTHSNTHTQTHTQTHTHTHTHTHTHTHTHTHSNTHSNTHTHTHTHSNT

Diamond finds no similarity of this in either rdrp0 or dicistro databases. This has no dicistro/rdrp0 match yet the HMM models hit. Traceback on the hit in pathracer, Anton working on this.

rcedgar commented 3 years ago

Yes, I found these a few days ago; there is a discussion with Anton on the slack . These are filtered out by the diamond E-value, but we should keep this issue in mind. PR does not report an E-value, so any long enough ORF will induce an alignment.

ababaian commented 3 years ago

Putative Error 2: Known polio, likely misassembly

From: rce/novel_species.fa

Line: 5573 >SRR1036663;Type=NovelGenus_Other;rdrp0=rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1(70.4%);Dicistro=none;
EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINTEQMCLEDAMYGTD
GLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYV
AMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK
VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFE
ALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSI
FNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK
RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRWTKDPRNTQDHVRSLCLLAWH
NGEEEYNKFLAKIRSVPI

Taken from SRR1036663: WT poliovirus serially passaged through HelaS3 - sequenced using CirSeq - Acevedo et al. Nature 2013

Blastp hit: 70.93% from polio

Select seq gb|AFQ38633.1|   RNA polymerase 3D [Human poliovirus 1]  Human poliovirus 1  781     781     100%    0.0     70.93%  461     AFQ38633.1

Query  1    EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL  60
            EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL
Sbjct  26   EPSAFHYVFEGVKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQL  85

Query  61   MSLDINTEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQNLS  120
            MSLDINTEQMCLEDAMYGTDGLEALDLSTS                              
Sbjct  86   MSLDINTEQMCLEDAMYGTDGLEALDLSTS------------------------------  115

Query  121  TSAGYPYVAMGKKKRDILNKQTRDTKEMQNLSTSAGYPYVAMGKKKRDILNKQTRDTKEM  180
              AGYPYVA+GKKKRDI+NKQTRDTKE                                M
Sbjct  116  --AGYPYVALGKKKRDIMNKQTRDTKE--------------------------------M  141

Query  181  QKLLDTYGINLPLVTYVKDELRSKNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTK  240
            Q+LLDTYGINLPLVTYVKDEL                                  RS+TK
Sbjct  142  QRLLDTYGINLPLVTYVKDEL----------------------------------RSRTK  167

Query  241  VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEE  300
            VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGV+TGSAVGCDPDLFWSKIPVLMEE
Sbjct  168  VEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVVTGSAVGCDPDLFWSKIPVLMEE  227

Query  301  KLFAFDYTGYDASLSPAWFEALKMGLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP  360
            KLFAFDYTGYDASLSPAWFEALKM LEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP
Sbjct  228  KLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMP  287

Query  361  SGCSGTSIFNSMINNSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKG  420
            SGCSGTSIFNSM                                INNLIIRTLLLKTYKG
Sbjct  288  SGCSGTSIFNSM--------------------------------INNLIIRTLLLKTYKG  315

Query  421  IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK  480
            IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT   
Sbjct  316  IDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVT---  372

Query  481  RQSGKDYGLTMTPADKSATFETVTWENVTFLKRFFRADEKYPFLIHPVMPMKEIHESIRW  540
                                         FLKRFFRADEKYPFLIHPVMPMKEIHESIRW
Sbjct  373  -----------------------------FLKRFFRADEKYPFLIHPVMPMKEIHESIRW  403

Query  541  TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI  578
            TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI
Sbjct  404  TKDPRNTQDHVRSLCLLAWHNGEEEYNKFLAKIRSVPI  441

That library contains a good clean hit to polio, and the hit above is from "another assembly" I am guessing.

From rc/all_gene_clusters.pathracer_vs_RdRP_1234q.seqs.fa

>Score=79.0939|Bitscore=161.231|PartialBitscore=161.231|Seq=SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9|Position=4995|Frameshifts=0|Alignment=29M3D12M1D9M1D21M6D39M2D15M8D56M1I7M1D12M3D28M1D10M4D13M1D46M10D16M2D21M12D10M57D5M

From: .rc/all_gene_clusters.fa.diamond_vs_rdrp0_q_d.fmt6

SRR1036663.NODE_1_length_6046_cluster_1_candidate_1_domains_9  rdrp2.Picornaviridae.Human_poliovirus_1:CAA24445.1      99.7    339     1       0       5029    6045    1       339     1.5e-195        680.2   6046    425     GTGAAGGAACCAGCAGTCCTCACTAAAAACGATCCCAGGCTTAAGACAGACTTTGAGGAGGCAATTTTCTCCAAGTACGTGGGTAACAAAATTACTGAAGTGGATGAGTACATGAAAGAGGCAGTAGACCACTATGCTGGCCAGCTCATGTCACTAGACATCAACACAGAACAAATGTGCTTGGAGGATGCCATGTATGGCACTGATGGTCTAGAAGCACTTGATTTGTCCACCAGTGCTGGCTACCCTTATGTAGCAATGGGAAAGAAGAAGAGAGACATCTTGAACAAACAAACCAGAGACACTAAGGAAATGCAAAAACTGCTCGACACATATGGAATCAACCTCCCACTGGTGACTTATGTAAAGGATGAACTTAGATCCAAAACAAAGGTTGAGCAGGGGAAATCCAGATTAATTGAAGCTTCTAGTTTGAATGACTCAGTGGCAATGAGAATGGCTTTTGGGAACCTATATGCTGCTTTTCACAAAAACCCAGGAGTGATAACAGGTTCAGCAGTGGGGTGCGATCCAGATTTGTTTTGGAGCAAAATTCCGGTATTGATGGAAGAGAAGCTGTTTGCTTTTGACTACACAGGGTATGATGCATCTCTCAGCCCTGCTTGGTTCGAGGCACTAAAGATGGTGCTTGAGAAAATCGGATTCGGAGACAGAGTTGACTACATCGACTACCTAAACCACTCACACCACCTGTACAAGAATAAAACATACTGTGTCAAGGGCGGTATGCCATCTGGCTGCTCAGGCACTTCAATTTTTAACTCAATGATTAACAACTTGATTATCAGGACACTCTTACTGAAAACCTACAAGGGCATAGATTTAGACCACCTAAAAATGATTGCCTATGGTGATGATGTAATTGCTTCCTACCCCCATGAAGTTGACGCTAGTCTCCTAGCCCAATCAGGAAAAGACTATGGACTAACTATGACTCCAGCTGACAAATCAGCTACATTTGAAACAGTCACATGGGAGAATGTAACATTCTTGAAG       VKEPAVLTKNDPRLKTDFEEAIFSKYVGNKITEVDEYMKEAVDHYAGQLMSLDINIEQMCLEDAMYGTDGLEALDLSTSAGYPYVAMGKKKRDILNKQTRDTKEMQKLLDTYGINLPLVTYVKDELRSKTKVEQGKSRLIEASSLNDSVAMRMAFGNLYAAFHKNPGVITGSAVGCDPDLFWSKIPVLMEEKLFAFDYTGYDASLSPAWFEALKMVLEKIGFGDRVDYIDYLNHSHHLYKNKTYCVKGGMPSGCSGTSIFNSMINNLIIRTLLLKTYKGIDLDHLKMIAYGDDVIASYPHEVDASLLAQSGKDYGLTMTPADKSATFETVTWENVTFLK

asl commented 3 years ago

I've uploaded PR hits to s3://serratus-public/assemblies/dicistro/analysis/ split into two groups:

all_assembly_graphs.RdRP_1234q.scaffold_seqs.fa.gz – all hits supported by scaffolds
all_assembly_graphs.RdRP_1234q.non_scaffold_seqs.fa.gz – all other hits

asl commented 3 years ago

Further uploaded:

all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.fa.gz    
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.diamond_vs_rdrp0_r1.fmt6.gz
all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.diamond_vs_dicistro.protref.fmt6.gz

ababaian commented 3 years ago

Here is the assembly graph from @asl for poliovirus: poliovirus_exclusion

The issue is when within a single assembly-graph there are multiple paths which provide a RdRp match, each one is being returned, this is yielding indels as seen above in slippage-prone viruses like polio (or other RNA viruses).

Proposed Solution: 1. For each RdRp output graph, allow each rdrp-containing edge (red highlight above) to be used at most ONCE per reported sequences. The "top hit" will be selected by percent-identity / top-score to a known virus (thus assuming the hit is not novel). The risk is if there is a novel virus in the same library as a known virus and they share homology over an edge of say 50 amino acids, then the novel virus would be excluded as the known virus takes priority. The benefit is this will reduce intra-sample viral variants to the most conservative.

One more caveat, assume rdrp is the red region in the graph above and the viral genome is green. If a sub-graph (blue) has a higher identity match then the longest match (green), but does not contain an end-to-end RdRp domain, the longest match with variants should take priority.

rcedgar commented 3 years ago

Analysis of high-trust scaffold sequences uploaded to:

s3://serratus-public/rce/dicistro_analysis/scaffolds/

15538 NovelFamily 9746 NovelGenus_Other 2252 NovelSpecies_Other 1713 KnownSpecies 877 Novel_NoMatch 481 NovelGenus_Dicistro 94 NovelSpecies_Dicistro

rchikhi commented 3 years ago

I extracted the assembled scaffolds that contain all of Anton's centroids (all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.fa.gz from https://github.com/ababaian/serratus/issues/239#issuecomment-751520032).

Results are in: all_assembly_graphs.RdRP_1234q.scaffold_seqs.centroids.scaffold_nucls.fa in s3://serratus-public/assemblies/dicistro/analysis/

generated using: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_get_scaffolds.py note: the script failed to get a scaffold for 2% (760) of the centroids, let me know if you need to recover them all.

ababaian / serratus

Dicistro assemblies #239

Error Type 1: Low complexity sequences

Putative Error 2: Known polio, likely misassembly