ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
251 stars 33 forks source link

Coronaspades mis-assembly problems & assembly QC #177

Closed rcedgar closed 3 years ago

rcedgar commented 4 years ago
  1. This Cat A assembly is badly messed up: ERR4164903.coronaspades.NODE_3_length_32040_cluster_3_candidate_1_domains_43

I can't quite figure out what is going on; there is a ~1,500nt segment in the contig that matches Cov-2 with 100% id, this segment appears multiple times in the contig. If you search for TGGGGCCGACGTTGTTTTGATCGCGCCCCACT you will find >10 copies of this motif. Repeated traversal of loops in the assembly graph is my guess.

@taltman what does your annotation pipeline generate for this one? Is it able to filter cases like this, or do we need a different QC check?

  1. This Cat A contig is mis-assembled: ERR4181727.coronaspades.NODE_1_length_29749_cluster_1_candidate_1_domains_42, it has two long segments of Cov-2 in the wrong order.
taltman commented 4 years ago

I'll take a quick look in a minute.

Could this be a case of various splice-forms mistakenly merged into a single contig?

taltman commented 4 years ago

Check out the annotation here: s3://serratus-taltman/annotations/test/

taltman commented 4 years ago

So, in the .alc file, there's this warning:

5     discontn  yes      DISCONTINUOUS_SIMILARITY       sequence      1     1  not all hits are in the same order in the sequence and the homology model

I think this catches the major rearrangement issue that you are concerned about.

4     dupregin  yes      DUPLICATE_REGIONS              sequence      1     1  similarity to a model region occurs more than once

This indicates that it sees a duplication, but you need to check out the *.alt file to figure out how big (and numerous) are the duplications. The answer is is yes, it notes many long duplications.

taltman commented 4 years ago

The issue is that VADR outputs a lot of diagnostic information, as it's designed to assist the GB indexers as they triage incoming viral GB submissions. It marks many CoV genomes as "fail", because they are too divergent, even though a close inspection shows that it is mostly OK.

So for automating this, we'll need to spot check a bunch of assemblies and decide which fails are considered significant (like this case), and which ones can be explained away as us dealing with a divergent genome.

rcedgar commented 4 years ago

@taltman Thanks for checking this! Looks like VADR is not quite right for filtering these problems, so I made QC script serratus_assembly_minimap2_qc.py (PR 178).

asl commented 4 years ago

@rcedgar Will you please tag myself and @1dayac for the assembly problems?

asl commented 4 years ago

Both looks like real misassemblies. We will take a look.

asl commented 4 years ago

ERR4164903 should be ignored. It's ONT data and likely contain hairpin turn which was not removed during adapter clipping.

asl commented 4 years ago

The table from @rchikhi:

worst alignment scores
    Assembly  Length  BestAlignmentScore BestReference
 SRR10873981   26508               11004   NC_011550.1
 SRR10873977   26576               11018   NC_011550.1
  ERR4181727   29749               30480    MT374101.1
  SRR7623977   27342               33322    LC260045.1
  SRR7239363   26091               37880    MT025058.1
 SRR11771957   29877               38720   NC_045512.2
  ERR4182199   30660               42274   NC_045512.2

The table from @rcedgar:

SRR9214144      partial_loclip  NC_001846       92.2
SRR1522982      partial_clipped EU186072        92.1
SRR10834677     partial_loclip  NC_001846       92.0
SRR7239363      partial_loclip  LC364344        91.9
SRR7716793      partial_loclip  NC_001846       91.8
SRR7716792      partial_loclip  NC_001846       91.7
SRR5040897      partial_loclip  NC_009021       91.0
SRR10873981     misassembly     NC_011550       82.2
SRR10873977     misassembly     NC_011550       81.9
SRR5384553      misassembly     JN874562        80.8
asl commented 4 years ago

SRR10873981 is 'brambling27'. @rcedgar analyzed is previously. Likely novel species. SRR10873977 is 'brambling28'. aligns to FJ647223.1 with 92% IDY and 99% coverage. Looks like no structural differences / split alignments. ERR4181727 looks like a real misassembly (split alignment) SRR7623977 might be a misassembly (split alignment) SRR7239363 aligns to LC364344.1 with 92% IDY and 100% coverage. Looks like no structural differences / split alignments SRR11771957 mighr be a misassembly (split alignment) ERR4182199 is ONT

rchikhi commented 4 years ago

based on SRA metadata, those 71 catA's are not Illumina nor BGISEQ: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA.tech.exclusion.txt

rcedgar commented 4 years ago

Added @asl, tried to assign @1dayac also but not in list of contributors and don't see how to add.

rchikhi commented 4 years ago

I ran blastn (megablast) over all catA-v3 assemblies. For each assembly, I selected 1 reference: the one with the lowest e-value (-> disputable choice). I flagged assemblies where all Blast alignments to the reference either collectively:

The idea was to cast a wide net and then we can further refine criteria to see if these assemblies are misassembled or novel.

That QC procedure flagged 114 datasets (out of 2.6k). It reported all of the datasets analyzed above (https://github.com/ababaian/serratus/issues/177#issuecomment-650822900):

In Blast fmt6 format:

qseqid                                                                          sseqid          pident length mismatch gapopen qstart   qend  sstart send evalue bitscore
ERR4181727.coronaspades.NODE_1_length_29749_cluster_1_candidate_1_domains_42    MT428554.1      99.993  15246   1       0       1       15246   15247   2       0.0     28149
ERR4181727.coronaspades.NODE_1_length_29749_cluster_1_candidate_1_domains_42    MT428554.1      99.993  14507   1       0       15243   29749   29664   15158   0.0     26784
SRR7623977.coronaspades.NODE_1_length_27342_cluster_1_candidate_1_domains_26    KY364365.1      99.737  17094   45      0       10245   27338   8329    25422   0.0     31318
SRR7623977.coronaspades.NODE_1_length_27342_cluster_1_candidate_1_domains_26    KY364365.1      99.843  10217   16      0       1       10217   19      10235   0.0     18779

SRR7239363 : flagged as it is weird. The best reference has only 1 alignment which covers 72% of the assembly at 100% identity.

SRR7239363.coronaspades.NODE_1_length_26091_cluster_1_candidate_1_domains_24    MT025058.1      100.000 18940   0       0       1       18940   5       18944   0.0     34976

Yet, the second best Blast hit covers much more of the assembly, but at lower IDY.

SRR7239363.coronaspades.NODE_1_length_26091_cluster_1_candidate_1_domains_24    LC364344.1      93.441  19530   1276    5       1       19527   20      19547   0.0     28963
SRR7239363.coronaspades.NODE_1_length_26091_cluster_1_candidate_1_domains_24    LC364344.1      89.865  6147    543     42      19998   26091   20021   26140   0.0     7827
SRR11771957.coronaspades.NODE_1_length_29877_cluster_1_candidate_1_domains_46   MT482127.1      100.000 19372   0       0       10506   29877   3       19374   0.0     35774
SRR11771957.coronaspades.NODE_1_length_29877_cluster_1_candidate_1_domains_46   MT482127.1      100.000 10498   0       0       1       10498   19362   29859   0.0     19387

ERR4182199: flagged, ONT, would have been a fine assembly except that it has a split alignment..

ERR4182199.coronaspades.NODE_1_length_30660_cluster_1_candidate_1_domains_44    MT627426.1      99.986  21157   1       2       9224    30379   10      21165   0.0     39052
ERR4182199.coronaspades.NODE_1_length_30660_cluster_1_candidate_1_domains_44    MT627426.1      99.918  8554    4       3       444     8994    29847   21294   0.0     15754

Full results, and explanation of the table:

Assembly: accession name AssemblyLen: length of assembly Tech: sequencing technology BestRef : identifier of reference with lowest e-value hit BestRefLen : length of BestRef LongestHit: longest hit to BestRef QueryCov : percentage of assembly covered by Blast hits to BestRef RefCov : percentage of BestRef covered by Balst hits QueryDup : number of assembly bases which are covered by >= 2 hits RefDup : same but reference bases IDYmean : mean identity over all the hits (where 1 hit = 1 datapoint. NOT 1 basepair = 1 datapoint) DYstdev : corresponding stdev

Assembly AssemblyLen Tech BestRef BestRefLen LongestHit QueryCov RefCov QueryDup RefDup IDYmean IDYstdev
0 ERR1301465 31358 ILLUMINA MH687969.1 31352 31234 0.998246 0.998979 394 357 98.4641 4.33473
1 SRR11140744 30087 ILLUMINA MT039887.1 29879 29867 0.999867 0.999565 5 222 100 0
2 SRR2146117 26537 ILLUMINA AY597011.2 29926 26537 0.999962 0.899519 1508 1125 94.4134 2.47587
3 SRR11954068 28303 ILLUMINA MT520382.1 29866 28303 0.999965 0.947633 0 0 100 0
4 SRR5872115 25054 ILLUMINA KY967357.1 27242 25054 0.99996 0.919646 0 0 100 0
5 SRR1191668 33011 ILLUMINA JX869059.2 30119 30098 0.911726 0.99927 0 0 99.987 0
6 SRR11777734 28026 ILLUMINA MT079851.1 30018 28016 0.999608 0.937071 113 0 99.563 0.612354
7 SRR1193924 31708 ILLUMINA JX869059.2 30119 30097 0.949161 0.999236 0 0 99.987 0
8 SRR5871990 26501 ILLUMINA KY674943.1 29908 26501 0.999962 0.890665 183 45 100 0
9 SRR11597202 25232 ILLUMINA MT499184.1 29893 25231 0.999921 0.84401 0 0 99.984 0
10 SRR11593364 29998 ILLUMINA MT079852.1 29891 29891 0.999633 0.999967 161 258 99.7308 0.529869
11 ERR963004 32325 ILLUMINA MK039552.1 30123 29926 0.929899 0.993427 0 134 99.623 0.514774
12 SRR11577861 25220 ILLUMINA MT628081.1 29872 25220 0.99996 0.844235 0 0 99.996 0
13 SRR10742607 25497 ILLUMINA MN566147.1 27631 25470 0.998863 0.921755 0 0 99.996 0
14 SRR11777736 28045 ILLUMINA MT079851.1 30018 28020 0.999073 0.937204 113 0 99.5615 0.610233
15 SRR10357373 32359 ILLUMINA JX869059.2 30119 30127 0.930869 0.999934 0 0 99.924 0
16 SRR11597205 25229 ILLUMINA MT270112.1 29858 25229 0.99996 0.844933 0 0 100 0
17 SRR11939545 26901 ILLUMINA MT293186.1 29865 26824 0.996989 0.898142 0 0 99.981 0
18 SRR10357367 26054 ILLUMINA JX869059.2 30119 26030 0.999962 0.863907 6 34 98.531 1.96293
19 SRR1190504 31944 ILLUMINA JX869059.2 30119 30099 0.942211 0.999303 0 0 99.987 0
20 ERR4164784 51507 OXFORD_NANOPORE MT499184.1 29893 27778 0.997068 0.99709 0 21524 99.0312 0.769228
21 ERR1301828 28501 ILLUMINA MH687944.1 28219 28227 0.998281 0.999787 30 272 97.806 3.06186
22 SRR11578300 28002 ILLUMINA MT419814.1 29743 27975 0.999 0.940524 0 0 99.996 0
23 SRR11826862 25150 ILLUMINA MT520419.1 29833 25150 0.99996 0.842993 0 0 99.992 0
24 ERR2819915 29054 ILLUMINA AB354579.1 31038 28995 0.9979 0.934113 0 0 99.672 0
25 ERR1301554 31380 ILLUMINA MH687969.1 31352 31232 0.999267 0.998916 440 461 98.3753 4.12254
26 ERR4164855 31837 OXFORD_NANOPORE MT627426.1 29903 29520 0.999183 0.991941 4 2259 94.629 4.33393
27 SRR11771957 29877 ILLUMINA MT482127.1 29901 19372 0.999699 0.998495 0 12 100 0
28 SRR1030058 32413 ILLUMINA FJ882957.1 29720 29713 0.918119 0.999731 0 47 99.9965 0.00494975
29 SRR11772243 30990 ILLUMINA MT450890.1 29865 22387 0.999839 0.998493 0 1165 100 0
30 ERR1424897 26516 ILLUMINA KF686344.1 29695 26486 0.998831 0.899377 337 115 98.5174 0.889548
31 ERR4164797 30903 OXFORD_NANOPORE MT627422.1 29871 29808 0.999612 0.998761 0 1057 97.4345 3.51361
32 SRR6713714 28120 ILLUMINA KU982977.1 28003 27982 0.999858 0.999214 62 201 95.1403 8.35232
33 ERR4182199 30660 OXFORD_NANOPORE MT627426.1 29903 21157 0.988193 0.993479 1 593 99.1502 1.12096
34 SRR6713776 28113 ILLUMINA MK071627.1 28017 27979 0.999964 0.998608 63 201 95.145 8.35633
35 SRR1195790 32562 ILLUMINA AY278741.1 29727 29702 0.912137 0.999125 0 0 99.997 0
36 ERR4181727 29749 ILLUMINA MT428554.1 29903 15246 0.999966 0.991941 3 89 99.993 0
37 SRR6713741 28070 ILLUMINA KJ645665.1 27998 27959 0.999964 0.998214 153 287 97.255 3.32014
38 SRR1030193 31374 ILLUMINA FJ882957.1 29720 29720 0.947249 0.999966 0 0 100 0
39 SRR1195366 32879 ILLUMINA AY278741.1 29727 29701 0.903312 0.999092 0 0 99.997 0
40 SRR11494499 25216 ILLUMINA MT520215.1 29849 25216 0.99996 0.844752 0 0 99.992 0
41 SRR11578190 25234 ILLUMINA MT470114.1 29903 25217 0.999287 0.84326 0 0 100 0
42 ERR4182183 27534 OXFORD_NANOPORE MT628262.1 29900 27293 0.999891 0.912676 0 251 97.045 4.10688
43 SRR11777735 28047 ILLUMINA MT079851.1 30018 28016 0.998859 0.937071 113 0 99.563 0.612354
44 ERR3179284 29326 ILLUMINA MH687972.1 31274 29326 0.999966 0.93768 0 0 99.997 0
45 SRR11578204 25216 ILLUMINA MT628262.1 29900 25216 0.99996 0.843311 0 0 99.992 0
46 ERR4164796 30423 OXFORD_NANOPORE MT627426.1 29903 29820 0.994379 0.997191 1 436 98.022 1.71026
47 ERR4182432 30345 OXFORD_NANOPORE MT520546.1 29866 29838 0.997034 0.999029 0 457 95.6105 6.11718
48 SRR1193955 34150 ILLUMINA JX869059.2 30119 30099 0.881347 0.999303 0 0 99.987 0
49 SRR11494528 25213 ILLUMINA MT628262.1 29900 25213 0.99996 0.843211 0 0 99.992 0
50 ERR4182439 30298 OXFORD_NANOPORE MT535497.1 29850 29692 0.999967 0.99464 6 625 97.9965 2.42265
51 SRR9430938 31509 CAPILLARY FJ425188.1 30995 30995 0.990162 0.999968 0 207 96.644 4.72772
52 ERR4181768 26418 ILLUMINA MT325599.1 29882 26412 0.999735 0.88381 0 0 99.996 0
53 SRR6713767 28129 ILLUMINA KF650373.1 28038 27988 0.999964 0.998181 62 207 95.074 8.37754
54 SRR11494457 25301 ILLUMINA MT451603.1 29816 25206 0.998933 0.845351 2 71 99.998 0.00282843
55 ERR4182443 30141 OXFORD_NANOPORE MT627426.1 29903 24131 0.999967 0.998796 1 291 99.9065 0.0148492
56 SRR11859139 26806 ILLUMINA MT614525.1 29782 26752 0.999963 0.899201 7 32 99.9945 0.00777817
57 SRR11550039 32184 ILLUMINA FJ429166.1 29747 29709 0.923067 0.998689 0 0 99.997 0
58 SRR6713905 28123 ILLUMINA KR265814.1 27991 27986 0.999893 0.999786 62 201 94.8763 8.78454
59 SRR10873977 26576 ILLUMINA FJ376622.1 26552 14581 0.602386 0.602215 31 61 85.3877 8.58582
60 SRR11593359 30105 ILLUMINA MT358641.1 29903 29861 0.9999 0.998562 0 242 99.999 0.00173205
61 SRR11494750 25220 ILLUMINA MT628262.1 29900 25220 0.99996 0.843445 0 0 99.984 0
62 SRR11593358 25158 ILLUMINA MT499184.1 29893 25036 0.999921 0.837487 0 121 99.994 0.00848528
63 SRR8369129 26534 ILLUMINA DQ415911.1 29902 26528 0.999736 0.898602 1616 1306 95.1598 1.75425
64 SRR11140746 30137 ILLUMINA MT039887.1 29879 29875 0.99917 0.999833 3 241 100 0
65 SRR8189298 25031 ILLUMINA MK204411.1 28482 23419 0.998562 0.885507 331 90 94.6004 3.77351
66 SRR11857972 27110 ILLUMINA MT628262.1 29900 27110 0.999963 0.906656 0 0 100 0
67 SRR1190467 32181 ILLUMINA JX869059.2 30119 30101 0.936795 0.999369 0 47 99.9915 0.0120208
68 ERR4164805 26748 OXFORD_NANOPORE MT627422.1 29871 26582 0.999963 0.889425 0 178 99.344 0.634982
69 SRR1190466 33133 ILLUMINA JX869059.2 30119 30098 0.908369 0.99927 0 0 99.987 0
70 ERR3013343 28355 ILLUMINA LT898439.1 28030 28021 0.999612 0.999643 6 330 98.5505 1.67409
71 SRR6713902 28120 ILLUMINA KR265814.1 27991 27986 0.999964 0.999786 63 201 94.4063 8.40116
72 SRR11140748 30108 ILLUMINA MT039887.1 29879 29872 0.999934 0.999732 10 245 99.7333 0.46188
73 ERR3179763 28445 ILLUMINA MH687966.1 28389 28378 0.997434 0.999613 304 303 97.2037 4.79498
74 SRR7239363 26091 ILLUMINA MT025058.1 18944 18940 0.725882 0.999736 0 0 100 0
75 SRR6425313 26182 ILLUMINA KY994645.1 30684 26181 0.999924 0.853181 0 0 99.939 0
76 SRR5872097 26518 ILLUMINA KY674943.1 29908 26516 0.999887 0.891166 183 45 100 0
77 ERR4182427 30352 OXFORD_NANOPORE MT374105.1 29900 29863 0.999934 0.998696 1 502 98.2703 1.49561
78 SRR7623977 27342 ILLUMINA KY364365.1 25422 17094 0.998793 0.999253 0 1906 99.79 0.0749533
79 SRR6713880 28097 ILLUMINA KJ645704.1 27989 27964 0.999964 0.999071 69 206 94.7037 7.9971
80 ERR4181748 30021 ILLUMINA MT079845.1 29955 29937 0.9999 0.999967 273 333 97.1092 3.34857
81 SRR11494725 25279 ILLUMINA MT520521.1 29872 25217 0.99996 0.844403 3 58 99.996 0.00565685
82 SRR7623978 25432 ILLUMINA KY364365.1 25422 15970 0.999961 0.997246 3 82 99.784 0.066468
83 SRR10873981 26508 ILLUMINA FJ376622.1 26552 14582 0.602799 0.602215 0 0 83.2568 7.64135
84 ERR963027 28162 ILLUMINA KJ813439.1 30123 28042 0.998225 0.930883 0 71 98.5985 1.94666
85 SRR11494532 25237 ILLUMINA MT627426.1 29903 25217 0.999168 0.84326 0 0 99.996 0
86 SRR6713893 28113 ILLUMINA KJ645691.1 27998 27982 0.999964 0.999393 66 201 95.1093 8.32581
87 SRR10915173 38489 ILLUMINA KJ473822.1 30247 30256 0.786069 0.999967 0 0 98.38 0
88 SRR1192073 32442 ILLUMINA JX869059.2 30119 30098 0.927717 0.99927 0 0 99.987 0
89 SRR11393704 30190 ILLUMINA MT079851.1 30018 29884 0.998112 0.9993 115 252 99.3942 0.750928
90 SRR11621903 25821 ILLUMINA MT520215.1 29849 25824 0.999961 0.865088 0 0 99.969 0
91 SRR6893872 25471 ILLUMINA KY364365.1 25422 16001 0.999961 0.998781 3 82 99.781 0.0707107
92 SRR10829954 25269 ILLUMINA KY420075.1 27967 25212 0.997151 0.901455 0 0 99.917 0
93 SRR11621817 29932 ILLUMINA MT627434.1 29882 15955 0.999365 0.99585 8 163 99.9934 0.00920869
94 SRR11578301 27986 ILLUMINA MT628262.1 29900 27986 0.999964 0.935953 0 0 99.986 0
95 SRR11826984 25450 ILLUMINA MT520215.1 29849 25431 0.999214 0.851955 0 0 99.996 0
96 ERR1301491 28279 ILLUMINA MH687944.1 28219 28067 0.999965 0.994578 5 217 99.6527 0.601599
97 ERR2738384 28445 ILLUMINA MH687966.1 28389 28378 0.997434 0.999613 304 303 97.2037 4.79498
98 SRR11857956 26400 ILLUMINA MT558673.1 29817 26400 0.999962 0.885367 0 0 100 0
99 SRR11622083 26308 ILLUMINA MT627426.1 29903 26287 0.999164 0.878975 0 0 99.992 0
100 SRR11968904 28295 BGISEQ MT568634.1 29861 28295 0.999965 0.947524 0 0 100 0
101 ERR4182437 30333 OXFORD_NANOPORE MT627422.1 29871 29809 0.999967 0.997891 14 546 97.0783 3.43648
102 SRR11597213 30136 ILLUMINA MT499184.1 29893 29835 0.997379 0.998026 0 223 97.0965 4.10193
103 SRR1033849 25183 ION_TORRENT EU418976.1 27636 25068 0.999563 0.904653 8 122 97.461 2.15809
104 SRR6713743 28153 ILLUMINA KJ645704.1 27989 26015 0.998899 0.995069 40 313 98.7857 1.24156
105 SRR11939551 29914 ILLUMINA MT628262.1 29900 29683 0.999933 0.992709 1 231 100 0
106 SRR11597148 25229 ILLUMINA MT259230.1 29866 25229 0.99996 0.844706 0 0 99.996 0
107 ERR4182440 26986 OXFORD_NANOPORE MT627426.1 29903 26881 0.999963 0.89884 4 125 99.9535 0.0657609
108 SRR12003638 30104 ILLUMINA MT535488.1 29903 29854 0.999801 0.998294 14 259 99.763 0.465374
109 SRR11916011 26805 ILLUMINA MT627426.1 29903 26716 0.999217 0.893389 6 74 99.329 0.912168
110 ERR4164903 32040 OXFORD_NANOPORE MT520274.1 29749 1479 0.987797 0.0496823 0 30168 99.6514 1.10076
111 SRR11494506 25214 ILLUMINA MT628262.1 29900 25210 0.999802 0.84311 0 0 99.996 0
112 SRR11140749 31251 OXFORD_NANOPORE MT121215.1 29945 29911 0.995648 0.998831 8 1206 97.5272 1.59119
113 SRR1191915 32532 ILLUMINA JX869059.2 30119 30094 0.925028 0.999137 0 0 99.987 0

The table above: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.blastn-QC.txt Complete BLAST results for all catA-v3"s: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.megablast-nt-fmt6.gz Script used to generate all those results: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/stats/megablast_alignment_stats.ipynb

rchikhi commented 4 years ago

Also I had a look at one of those flagged by @rcedgar : SRR7716793 partial_loclip NC_001846 91.8:

I see a 91.8% IDY near 100% coverage alignment:

qseqid                                                                          sseqid          pident length mismatch gapopen qstart   qend  sstart send evalue bitscore
SRR7716793.coronaspades.NODE_1_length_31483_cluster_1_candidate_1_domains_34    MF416379.1      91.849  31467   2429    97      18      31411   11      31414   0.0     43809

So, this one is probably fine?

rcedgar commented 4 years ago

That looks good to me; if there is one long alignment covering most of the query contig and most of a full-length genome, then it has to be a pretty good assembly. 92% id is unusually low, borderline novel species, though "species" is a pretty arbitrary concept with viruses.

asl commented 4 years ago

I believe the majority of the issues were addressed in the last coronaSPAdes snapshot.