Closed rcedgar closed 3 years ago
I'll take a quick look in a minute.
Could this be a case of various splice-forms mistakenly merged into a single contig?
Check out the annotation here: s3://serratus-taltman/annotations/test/
So, in the .alc file, there's this warning:
5 discontn yes DISCONTINUOUS_SIMILARITY sequence 1 1 not all hits are in the same order in the sequence and the homology model
I think this catches the major rearrangement issue that you are concerned about.
4 dupregin yes DUPLICATE_REGIONS sequence 1 1 similarity to a model region occurs more than once
This indicates that it sees a duplication, but you need to check out the *.alt file to figure out how big (and numerous) are the duplications. The answer is is yes, it notes many long duplications.
The issue is that VADR outputs a lot of diagnostic information, as it's designed to assist the GB indexers as they triage incoming viral GB submissions. It marks many CoV genomes as "fail", because they are too divergent, even though a close inspection shows that it is mostly OK.
So for automating this, we'll need to spot check a bunch of assemblies and decide which fails are considered significant (like this case), and which ones can be explained away as us dealing with a divergent genome.
@taltman Thanks for checking this! Looks like VADR is not quite right for filtering these problems, so I made QC script serratus_assembly_minimap2_qc.py (PR 178).
@rcedgar Will you please tag myself and @1dayac for the assembly problems?
Both looks like real misassemblies. We will take a look.
ERR4164903 should be ignored. It's ONT data and likely contain hairpin turn which was not removed during adapter clipping.
The table from @rchikhi:
worst alignment scores
Assembly Length BestAlignmentScore BestReference
SRR10873981 26508 11004 NC_011550.1
SRR10873977 26576 11018 NC_011550.1
ERR4181727 29749 30480 MT374101.1
SRR7623977 27342 33322 LC260045.1
SRR7239363 26091 37880 MT025058.1
SRR11771957 29877 38720 NC_045512.2
ERR4182199 30660 42274 NC_045512.2
The table from @rcedgar:
SRR9214144 partial_loclip NC_001846 92.2
SRR1522982 partial_clipped EU186072 92.1
SRR10834677 partial_loclip NC_001846 92.0
SRR7239363 partial_loclip LC364344 91.9
SRR7716793 partial_loclip NC_001846 91.8
SRR7716792 partial_loclip NC_001846 91.7
SRR5040897 partial_loclip NC_009021 91.0
SRR10873981 misassembly NC_011550 82.2
SRR10873977 misassembly NC_011550 81.9
SRR5384553 misassembly JN874562 80.8
SRR10873981 is 'brambling27'. @rcedgar analyzed is previously. Likely novel species. SRR10873977 is 'brambling28'. aligns to FJ647223.1 with 92% IDY and 99% coverage. Looks like no structural differences / split alignments. ERR4181727 looks like a real misassembly (split alignment) SRR7623977 might be a misassembly (split alignment) SRR7239363 aligns to LC364344.1 with 92% IDY and 100% coverage. Looks like no structural differences / split alignments SRR11771957 mighr be a misassembly (split alignment) ERR4182199 is ONT
based on SRA metadata, those 71 catA's are not Illumina nor BGISEQ: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA.tech.exclusion.txt
Added @asl, tried to assign @1dayac also but not in list of contributors and don't see how to add.
I ran blastn (megablast) over all catA-v3 assemblies. For each assembly, I selected 1 reference: the one with the lowest e-value (-> disputable choice). I flagged assemblies where all Blast alignments to the reference either collectively:
The idea was to cast a wide net and then we can further refine criteria to see if these assemblies are misassembled or novel.
That QC procedure flagged 114 datasets (out of 2.6k). It reported all of the datasets analyzed above (https://github.com/ababaian/serratus/issues/177#issuecomment-650822900):
SRR10873981: previously 'likely novel species', flagged by this QC. 60% of BestRef covered, 83% IDY: maybe Megablast didn't catch lower-identity parts.
SRR10873977 : exactly same as SRR10873981
ERR4181727: flagged due to short longest hit (split alignment):
In Blast fmt6 format:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
ERR4181727.coronaspades.NODE_1_length_29749_cluster_1_candidate_1_domains_42 MT428554.1 99.993 15246 1 0 1 15246 15247 2 0.0 28149
ERR4181727.coronaspades.NODE_1_length_29749_cluster_1_candidate_1_domains_42 MT428554.1 99.993 14507 1 0 15243 29749 29664 15158 0.0 26784
SRR7623977.coronaspades.NODE_1_length_27342_cluster_1_candidate_1_domains_26 KY364365.1 99.737 17094 45 0 10245 27338 8329 25422 0.0 31318
SRR7623977.coronaspades.NODE_1_length_27342_cluster_1_candidate_1_domains_26 KY364365.1 99.843 10217 16 0 1 10217 19 10235 0.0 18779
SRR7239363 : flagged as it is weird. The best reference has only 1 alignment which covers 72% of the assembly at 100% identity.
SRR7239363.coronaspades.NODE_1_length_26091_cluster_1_candidate_1_domains_24 MT025058.1 100.000 18940 0 0 1 18940 5 18944 0.0 34976
Yet, the second best Blast hit covers much more of the assembly, but at lower IDY.
SRR7239363.coronaspades.NODE_1_length_26091_cluster_1_candidate_1_domains_24 LC364344.1 93.441 19530 1276 5 1 19527 20 19547 0.0 28963
SRR7239363.coronaspades.NODE_1_length_26091_cluster_1_candidate_1_domains_24 LC364344.1 89.865 6147 543 42 19998 26091 20021 26140 0.0 7827
SRR11771957.coronaspades.NODE_1_length_29877_cluster_1_candidate_1_domains_46 MT482127.1 100.000 19372 0 0 10506 29877 3 19374 0.0 35774
SRR11771957.coronaspades.NODE_1_length_29877_cluster_1_candidate_1_domains_46 MT482127.1 100.000 10498 0 0 1 10498 19362 29859 0.0 19387
ERR4182199: flagged, ONT, would have been a fine assembly except that it has a split alignment..
ERR4182199.coronaspades.NODE_1_length_30660_cluster_1_candidate_1_domains_44 MT627426.1 99.986 21157 1 2 9224 30379 10 21165 0.0 39052
ERR4182199.coronaspades.NODE_1_length_30660_cluster_1_candidate_1_domains_44 MT627426.1 99.918 8554 4 3 444 8994 29847 21294 0.0 15754
Full results, and explanation of the table:
Assembly: accession name AssemblyLen: length of assembly Tech: sequencing technology BestRef : identifier of reference with lowest e-value hit BestRefLen : length of BestRef LongestHit: longest hit to BestRef QueryCov : percentage of assembly covered by Blast hits to BestRef RefCov : percentage of BestRef covered by Balst hits QueryDup : number of assembly bases which are covered by >= 2 hits RefDup : same but reference bases IDYmean : mean identity over all the hits (where 1 hit = 1 datapoint. NOT 1 basepair = 1 datapoint) DYstdev : corresponding stdev
Assembly | AssemblyLen | Tech | BestRef | BestRefLen | LongestHit | QueryCov | RefCov | QueryDup | RefDup | IDYmean | IDYstdev | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ERR1301465 | 31358 | ILLUMINA | MH687969.1 | 31352 | 31234 | 0.998246 | 0.998979 | 394 | 357 | 98.4641 | 4.33473 |
1 | SRR11140744 | 30087 | ILLUMINA | MT039887.1 | 29879 | 29867 | 0.999867 | 0.999565 | 5 | 222 | 100 | 0 |
2 | SRR2146117 | 26537 | ILLUMINA | AY597011.2 | 29926 | 26537 | 0.999962 | 0.899519 | 1508 | 1125 | 94.4134 | 2.47587 |
3 | SRR11954068 | 28303 | ILLUMINA | MT520382.1 | 29866 | 28303 | 0.999965 | 0.947633 | 0 | 0 | 100 | 0 |
4 | SRR5872115 | 25054 | ILLUMINA | KY967357.1 | 27242 | 25054 | 0.99996 | 0.919646 | 0 | 0 | 100 | 0 |
5 | SRR1191668 | 33011 | ILLUMINA | JX869059.2 | 30119 | 30098 | 0.911726 | 0.99927 | 0 | 0 | 99.987 | 0 |
6 | SRR11777734 | 28026 | ILLUMINA | MT079851.1 | 30018 | 28016 | 0.999608 | 0.937071 | 113 | 0 | 99.563 | 0.612354 |
7 | SRR1193924 | 31708 | ILLUMINA | JX869059.2 | 30119 | 30097 | 0.949161 | 0.999236 | 0 | 0 | 99.987 | 0 |
8 | SRR5871990 | 26501 | ILLUMINA | KY674943.1 | 29908 | 26501 | 0.999962 | 0.890665 | 183 | 45 | 100 | 0 |
9 | SRR11597202 | 25232 | ILLUMINA | MT499184.1 | 29893 | 25231 | 0.999921 | 0.84401 | 0 | 0 | 99.984 | 0 |
10 | SRR11593364 | 29998 | ILLUMINA | MT079852.1 | 29891 | 29891 | 0.999633 | 0.999967 | 161 | 258 | 99.7308 | 0.529869 |
11 | ERR963004 | 32325 | ILLUMINA | MK039552.1 | 30123 | 29926 | 0.929899 | 0.993427 | 0 | 134 | 99.623 | 0.514774 |
12 | SRR11577861 | 25220 | ILLUMINA | MT628081.1 | 29872 | 25220 | 0.99996 | 0.844235 | 0 | 0 | 99.996 | 0 |
13 | SRR10742607 | 25497 | ILLUMINA | MN566147.1 | 27631 | 25470 | 0.998863 | 0.921755 | 0 | 0 | 99.996 | 0 |
14 | SRR11777736 | 28045 | ILLUMINA | MT079851.1 | 30018 | 28020 | 0.999073 | 0.937204 | 113 | 0 | 99.5615 | 0.610233 |
15 | SRR10357373 | 32359 | ILLUMINA | JX869059.2 | 30119 | 30127 | 0.930869 | 0.999934 | 0 | 0 | 99.924 | 0 |
16 | SRR11597205 | 25229 | ILLUMINA | MT270112.1 | 29858 | 25229 | 0.99996 | 0.844933 | 0 | 0 | 100 | 0 |
17 | SRR11939545 | 26901 | ILLUMINA | MT293186.1 | 29865 | 26824 | 0.996989 | 0.898142 | 0 | 0 | 99.981 | 0 |
18 | SRR10357367 | 26054 | ILLUMINA | JX869059.2 | 30119 | 26030 | 0.999962 | 0.863907 | 6 | 34 | 98.531 | 1.96293 |
19 | SRR1190504 | 31944 | ILLUMINA | JX869059.2 | 30119 | 30099 | 0.942211 | 0.999303 | 0 | 0 | 99.987 | 0 |
20 | ERR4164784 | 51507 | OXFORD_NANOPORE | MT499184.1 | 29893 | 27778 | 0.997068 | 0.99709 | 0 | 21524 | 99.0312 | 0.769228 |
21 | ERR1301828 | 28501 | ILLUMINA | MH687944.1 | 28219 | 28227 | 0.998281 | 0.999787 | 30 | 272 | 97.806 | 3.06186 |
22 | SRR11578300 | 28002 | ILLUMINA | MT419814.1 | 29743 | 27975 | 0.999 | 0.940524 | 0 | 0 | 99.996 | 0 |
23 | SRR11826862 | 25150 | ILLUMINA | MT520419.1 | 29833 | 25150 | 0.99996 | 0.842993 | 0 | 0 | 99.992 | 0 |
24 | ERR2819915 | 29054 | ILLUMINA | AB354579.1 | 31038 | 28995 | 0.9979 | 0.934113 | 0 | 0 | 99.672 | 0 |
25 | ERR1301554 | 31380 | ILLUMINA | MH687969.1 | 31352 | 31232 | 0.999267 | 0.998916 | 440 | 461 | 98.3753 | 4.12254 |
26 | ERR4164855 | 31837 | OXFORD_NANOPORE | MT627426.1 | 29903 | 29520 | 0.999183 | 0.991941 | 4 | 2259 | 94.629 | 4.33393 |
27 | SRR11771957 | 29877 | ILLUMINA | MT482127.1 | 29901 | 19372 | 0.999699 | 0.998495 | 0 | 12 | 100 | 0 |
28 | SRR1030058 | 32413 | ILLUMINA | FJ882957.1 | 29720 | 29713 | 0.918119 | 0.999731 | 0 | 47 | 99.9965 | 0.00494975 |
29 | SRR11772243 | 30990 | ILLUMINA | MT450890.1 | 29865 | 22387 | 0.999839 | 0.998493 | 0 | 1165 | 100 | 0 |
30 | ERR1424897 | 26516 | ILLUMINA | KF686344.1 | 29695 | 26486 | 0.998831 | 0.899377 | 337 | 115 | 98.5174 | 0.889548 |
31 | ERR4164797 | 30903 | OXFORD_NANOPORE | MT627422.1 | 29871 | 29808 | 0.999612 | 0.998761 | 0 | 1057 | 97.4345 | 3.51361 |
32 | SRR6713714 | 28120 | ILLUMINA | KU982977.1 | 28003 | 27982 | 0.999858 | 0.999214 | 62 | 201 | 95.1403 | 8.35232 |
33 | ERR4182199 | 30660 | OXFORD_NANOPORE | MT627426.1 | 29903 | 21157 | 0.988193 | 0.993479 | 1 | 593 | 99.1502 | 1.12096 |
34 | SRR6713776 | 28113 | ILLUMINA | MK071627.1 | 28017 | 27979 | 0.999964 | 0.998608 | 63 | 201 | 95.145 | 8.35633 |
35 | SRR1195790 | 32562 | ILLUMINA | AY278741.1 | 29727 | 29702 | 0.912137 | 0.999125 | 0 | 0 | 99.997 | 0 |
36 | ERR4181727 | 29749 | ILLUMINA | MT428554.1 | 29903 | 15246 | 0.999966 | 0.991941 | 3 | 89 | 99.993 | 0 |
37 | SRR6713741 | 28070 | ILLUMINA | KJ645665.1 | 27998 | 27959 | 0.999964 | 0.998214 | 153 | 287 | 97.255 | 3.32014 |
38 | SRR1030193 | 31374 | ILLUMINA | FJ882957.1 | 29720 | 29720 | 0.947249 | 0.999966 | 0 | 0 | 100 | 0 |
39 | SRR1195366 | 32879 | ILLUMINA | AY278741.1 | 29727 | 29701 | 0.903312 | 0.999092 | 0 | 0 | 99.997 | 0 |
40 | SRR11494499 | 25216 | ILLUMINA | MT520215.1 | 29849 | 25216 | 0.99996 | 0.844752 | 0 | 0 | 99.992 | 0 |
41 | SRR11578190 | 25234 | ILLUMINA | MT470114.1 | 29903 | 25217 | 0.999287 | 0.84326 | 0 | 0 | 100 | 0 |
42 | ERR4182183 | 27534 | OXFORD_NANOPORE | MT628262.1 | 29900 | 27293 | 0.999891 | 0.912676 | 0 | 251 | 97.045 | 4.10688 |
43 | SRR11777735 | 28047 | ILLUMINA | MT079851.1 | 30018 | 28016 | 0.998859 | 0.937071 | 113 | 0 | 99.563 | 0.612354 |
44 | ERR3179284 | 29326 | ILLUMINA | MH687972.1 | 31274 | 29326 | 0.999966 | 0.93768 | 0 | 0 | 99.997 | 0 |
45 | SRR11578204 | 25216 | ILLUMINA | MT628262.1 | 29900 | 25216 | 0.99996 | 0.843311 | 0 | 0 | 99.992 | 0 |
46 | ERR4164796 | 30423 | OXFORD_NANOPORE | MT627426.1 | 29903 | 29820 | 0.994379 | 0.997191 | 1 | 436 | 98.022 | 1.71026 |
47 | ERR4182432 | 30345 | OXFORD_NANOPORE | MT520546.1 | 29866 | 29838 | 0.997034 | 0.999029 | 0 | 457 | 95.6105 | 6.11718 |
48 | SRR1193955 | 34150 | ILLUMINA | JX869059.2 | 30119 | 30099 | 0.881347 | 0.999303 | 0 | 0 | 99.987 | 0 |
49 | SRR11494528 | 25213 | ILLUMINA | MT628262.1 | 29900 | 25213 | 0.99996 | 0.843211 | 0 | 0 | 99.992 | 0 |
50 | ERR4182439 | 30298 | OXFORD_NANOPORE | MT535497.1 | 29850 | 29692 | 0.999967 | 0.99464 | 6 | 625 | 97.9965 | 2.42265 |
51 | SRR9430938 | 31509 | CAPILLARY | FJ425188.1 | 30995 | 30995 | 0.990162 | 0.999968 | 0 | 207 | 96.644 | 4.72772 |
52 | ERR4181768 | 26418 | ILLUMINA | MT325599.1 | 29882 | 26412 | 0.999735 | 0.88381 | 0 | 0 | 99.996 | 0 |
53 | SRR6713767 | 28129 | ILLUMINA | KF650373.1 | 28038 | 27988 | 0.999964 | 0.998181 | 62 | 207 | 95.074 | 8.37754 |
54 | SRR11494457 | 25301 | ILLUMINA | MT451603.1 | 29816 | 25206 | 0.998933 | 0.845351 | 2 | 71 | 99.998 | 0.00282843 |
55 | ERR4182443 | 30141 | OXFORD_NANOPORE | MT627426.1 | 29903 | 24131 | 0.999967 | 0.998796 | 1 | 291 | 99.9065 | 0.0148492 |
56 | SRR11859139 | 26806 | ILLUMINA | MT614525.1 | 29782 | 26752 | 0.999963 | 0.899201 | 7 | 32 | 99.9945 | 0.00777817 |
57 | SRR11550039 | 32184 | ILLUMINA | FJ429166.1 | 29747 | 29709 | 0.923067 | 0.998689 | 0 | 0 | 99.997 | 0 |
58 | SRR6713905 | 28123 | ILLUMINA | KR265814.1 | 27991 | 27986 | 0.999893 | 0.999786 | 62 | 201 | 94.8763 | 8.78454 |
59 | SRR10873977 | 26576 | ILLUMINA | FJ376622.1 | 26552 | 14581 | 0.602386 | 0.602215 | 31 | 61 | 85.3877 | 8.58582 |
60 | SRR11593359 | 30105 | ILLUMINA | MT358641.1 | 29903 | 29861 | 0.9999 | 0.998562 | 0 | 242 | 99.999 | 0.00173205 |
61 | SRR11494750 | 25220 | ILLUMINA | MT628262.1 | 29900 | 25220 | 0.99996 | 0.843445 | 0 | 0 | 99.984 | 0 |
62 | SRR11593358 | 25158 | ILLUMINA | MT499184.1 | 29893 | 25036 | 0.999921 | 0.837487 | 0 | 121 | 99.994 | 0.00848528 |
63 | SRR8369129 | 26534 | ILLUMINA | DQ415911.1 | 29902 | 26528 | 0.999736 | 0.898602 | 1616 | 1306 | 95.1598 | 1.75425 |
64 | SRR11140746 | 30137 | ILLUMINA | MT039887.1 | 29879 | 29875 | 0.99917 | 0.999833 | 3 | 241 | 100 | 0 |
65 | SRR8189298 | 25031 | ILLUMINA | MK204411.1 | 28482 | 23419 | 0.998562 | 0.885507 | 331 | 90 | 94.6004 | 3.77351 |
66 | SRR11857972 | 27110 | ILLUMINA | MT628262.1 | 29900 | 27110 | 0.999963 | 0.906656 | 0 | 0 | 100 | 0 |
67 | SRR1190467 | 32181 | ILLUMINA | JX869059.2 | 30119 | 30101 | 0.936795 | 0.999369 | 0 | 47 | 99.9915 | 0.0120208 |
68 | ERR4164805 | 26748 | OXFORD_NANOPORE | MT627422.1 | 29871 | 26582 | 0.999963 | 0.889425 | 0 | 178 | 99.344 | 0.634982 |
69 | SRR1190466 | 33133 | ILLUMINA | JX869059.2 | 30119 | 30098 | 0.908369 | 0.99927 | 0 | 0 | 99.987 | 0 |
70 | ERR3013343 | 28355 | ILLUMINA | LT898439.1 | 28030 | 28021 | 0.999612 | 0.999643 | 6 | 330 | 98.5505 | 1.67409 |
71 | SRR6713902 | 28120 | ILLUMINA | KR265814.1 | 27991 | 27986 | 0.999964 | 0.999786 | 63 | 201 | 94.4063 | 8.40116 |
72 | SRR11140748 | 30108 | ILLUMINA | MT039887.1 | 29879 | 29872 | 0.999934 | 0.999732 | 10 | 245 | 99.7333 | 0.46188 |
73 | ERR3179763 | 28445 | ILLUMINA | MH687966.1 | 28389 | 28378 | 0.997434 | 0.999613 | 304 | 303 | 97.2037 | 4.79498 |
74 | SRR7239363 | 26091 | ILLUMINA | MT025058.1 | 18944 | 18940 | 0.725882 | 0.999736 | 0 | 0 | 100 | 0 |
75 | SRR6425313 | 26182 | ILLUMINA | KY994645.1 | 30684 | 26181 | 0.999924 | 0.853181 | 0 | 0 | 99.939 | 0 |
76 | SRR5872097 | 26518 | ILLUMINA | KY674943.1 | 29908 | 26516 | 0.999887 | 0.891166 | 183 | 45 | 100 | 0 |
77 | ERR4182427 | 30352 | OXFORD_NANOPORE | MT374105.1 | 29900 | 29863 | 0.999934 | 0.998696 | 1 | 502 | 98.2703 | 1.49561 |
78 | SRR7623977 | 27342 | ILLUMINA | KY364365.1 | 25422 | 17094 | 0.998793 | 0.999253 | 0 | 1906 | 99.79 | 0.0749533 |
79 | SRR6713880 | 28097 | ILLUMINA | KJ645704.1 | 27989 | 27964 | 0.999964 | 0.999071 | 69 | 206 | 94.7037 | 7.9971 |
80 | ERR4181748 | 30021 | ILLUMINA | MT079845.1 | 29955 | 29937 | 0.9999 | 0.999967 | 273 | 333 | 97.1092 | 3.34857 |
81 | SRR11494725 | 25279 | ILLUMINA | MT520521.1 | 29872 | 25217 | 0.99996 | 0.844403 | 3 | 58 | 99.996 | 0.00565685 |
82 | SRR7623978 | 25432 | ILLUMINA | KY364365.1 | 25422 | 15970 | 0.999961 | 0.997246 | 3 | 82 | 99.784 | 0.066468 |
83 | SRR10873981 | 26508 | ILLUMINA | FJ376622.1 | 26552 | 14582 | 0.602799 | 0.602215 | 0 | 0 | 83.2568 | 7.64135 |
84 | ERR963027 | 28162 | ILLUMINA | KJ813439.1 | 30123 | 28042 | 0.998225 | 0.930883 | 0 | 71 | 98.5985 | 1.94666 |
85 | SRR11494532 | 25237 | ILLUMINA | MT627426.1 | 29903 | 25217 | 0.999168 | 0.84326 | 0 | 0 | 99.996 | 0 |
86 | SRR6713893 | 28113 | ILLUMINA | KJ645691.1 | 27998 | 27982 | 0.999964 | 0.999393 | 66 | 201 | 95.1093 | 8.32581 |
87 | SRR10915173 | 38489 | ILLUMINA | KJ473822.1 | 30247 | 30256 | 0.786069 | 0.999967 | 0 | 0 | 98.38 | 0 |
88 | SRR1192073 | 32442 | ILLUMINA | JX869059.2 | 30119 | 30098 | 0.927717 | 0.99927 | 0 | 0 | 99.987 | 0 |
89 | SRR11393704 | 30190 | ILLUMINA | MT079851.1 | 30018 | 29884 | 0.998112 | 0.9993 | 115 | 252 | 99.3942 | 0.750928 |
90 | SRR11621903 | 25821 | ILLUMINA | MT520215.1 | 29849 | 25824 | 0.999961 | 0.865088 | 0 | 0 | 99.969 | 0 |
91 | SRR6893872 | 25471 | ILLUMINA | KY364365.1 | 25422 | 16001 | 0.999961 | 0.998781 | 3 | 82 | 99.781 | 0.0707107 |
92 | SRR10829954 | 25269 | ILLUMINA | KY420075.1 | 27967 | 25212 | 0.997151 | 0.901455 | 0 | 0 | 99.917 | 0 |
93 | SRR11621817 | 29932 | ILLUMINA | MT627434.1 | 29882 | 15955 | 0.999365 | 0.99585 | 8 | 163 | 99.9934 | 0.00920869 |
94 | SRR11578301 | 27986 | ILLUMINA | MT628262.1 | 29900 | 27986 | 0.999964 | 0.935953 | 0 | 0 | 99.986 | 0 |
95 | SRR11826984 | 25450 | ILLUMINA | MT520215.1 | 29849 | 25431 | 0.999214 | 0.851955 | 0 | 0 | 99.996 | 0 |
96 | ERR1301491 | 28279 | ILLUMINA | MH687944.1 | 28219 | 28067 | 0.999965 | 0.994578 | 5 | 217 | 99.6527 | 0.601599 |
97 | ERR2738384 | 28445 | ILLUMINA | MH687966.1 | 28389 | 28378 | 0.997434 | 0.999613 | 304 | 303 | 97.2037 | 4.79498 |
98 | SRR11857956 | 26400 | ILLUMINA | MT558673.1 | 29817 | 26400 | 0.999962 | 0.885367 | 0 | 0 | 100 | 0 |
99 | SRR11622083 | 26308 | ILLUMINA | MT627426.1 | 29903 | 26287 | 0.999164 | 0.878975 | 0 | 0 | 99.992 | 0 |
100 | SRR11968904 | 28295 | BGISEQ | MT568634.1 | 29861 | 28295 | 0.999965 | 0.947524 | 0 | 0 | 100 | 0 |
101 | ERR4182437 | 30333 | OXFORD_NANOPORE | MT627422.1 | 29871 | 29809 | 0.999967 | 0.997891 | 14 | 546 | 97.0783 | 3.43648 |
102 | SRR11597213 | 30136 | ILLUMINA | MT499184.1 | 29893 | 29835 | 0.997379 | 0.998026 | 0 | 223 | 97.0965 | 4.10193 |
103 | SRR1033849 | 25183 | ION_TORRENT | EU418976.1 | 27636 | 25068 | 0.999563 | 0.904653 | 8 | 122 | 97.461 | 2.15809 |
104 | SRR6713743 | 28153 | ILLUMINA | KJ645704.1 | 27989 | 26015 | 0.998899 | 0.995069 | 40 | 313 | 98.7857 | 1.24156 |
105 | SRR11939551 | 29914 | ILLUMINA | MT628262.1 | 29900 | 29683 | 0.999933 | 0.992709 | 1 | 231 | 100 | 0 |
106 | SRR11597148 | 25229 | ILLUMINA | MT259230.1 | 29866 | 25229 | 0.99996 | 0.844706 | 0 | 0 | 99.996 | 0 |
107 | ERR4182440 | 26986 | OXFORD_NANOPORE | MT627426.1 | 29903 | 26881 | 0.999963 | 0.89884 | 4 | 125 | 99.9535 | 0.0657609 |
108 | SRR12003638 | 30104 | ILLUMINA | MT535488.1 | 29903 | 29854 | 0.999801 | 0.998294 | 14 | 259 | 99.763 | 0.465374 |
109 | SRR11916011 | 26805 | ILLUMINA | MT627426.1 | 29903 | 26716 | 0.999217 | 0.893389 | 6 | 74 | 99.329 | 0.912168 |
110 | ERR4164903 | 32040 | OXFORD_NANOPORE | MT520274.1 | 29749 | 1479 | 0.987797 | 0.0496823 | 0 | 30168 | 99.6514 | 1.10076 |
111 | SRR11494506 | 25214 | ILLUMINA | MT628262.1 | 29900 | 25210 | 0.999802 | 0.84311 | 0 | 0 | 99.996 | 0 |
112 | SRR11140749 | 31251 | OXFORD_NANOPORE | MT121215.1 | 29945 | 29911 | 0.995648 | 0.998831 | 8 | 1206 | 97.5272 | 1.59119 |
113 | SRR1191915 | 32532 | ILLUMINA | JX869059.2 | 30119 | 30094 | 0.925028 | 0.999137 | 0 | 0 | 99.987 | 0 |
The table above: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.blastn-QC.txt Complete BLAST results for all catA-v3"s: https://serratus-public.s3.amazonaws.com/assemblies/analysis/catA-v3.megablast-nt-fmt6.gz Script used to generate all those results: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/stats/megablast_alignment_stats.ipynb
Also I had a look at one of those flagged by @rcedgar : SRR7716793 partial_loclip NC_001846 91.8:
I see a 91.8% IDY near 100% coverage alignment:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
SRR7716793.coronaspades.NODE_1_length_31483_cluster_1_candidate_1_domains_34 MF416379.1 91.849 31467 2429 97 18 31411 11 31414 0.0 43809
So, this one is probably fine?
That looks good to me; if there is one long alignment covering most of the query contig and most of a full-length genome, then it has to be a pretty good assembly. 92% id is unusually low, borderline novel species, though "species" is a pretty arbitrary concept with viruses.
I believe the majority of the issues were addressed in the last coronaSPAdes snapshot.
ERR4164903.coronaspades.NODE_3_length_32040_cluster_3_candidate_1_domains_43
I can't quite figure out what is going on; there is a ~1,500nt segment in the contig that matches Cov-2 with 100% id, this segment appears multiple times in the contig. If you search for
TGGGGCCGACGTTGTTTTGATCGCGCCCCACT
you will find >10 copies of this motif. Repeated traversal of loops in the assembly graph is my guess.@taltman what does your annotation pipeline generate for this one? Is it able to filter cases like this, or do we need a different QC check?
ERR4181727.coronaspades.NODE_1_length_29749_cluster_1_candidate_1_domains_42
, it has two long segments of Cov-2 in the wrong order.