ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

A portion of the master table assemblies will be re-run/re-annotated #214

Closed rchikhi closed 3 years ago

rchikhi commented 3 years ago

This is just to bring a matter to your attention since it affects assemblies inside the master table. Input is not necessarily needed.

We have found a problem with all assemblies: they were all run with single-end mode, regardless of whether reads were paired-end. This was due a bug in my script where I trusted the output of a read type detection software, which turned out to always spit 'single-end' on fastp output. So, as a result, assemblies were sometimes fragmented into 2 or more contigs. coronaSPAdes could have performed paired-end scaffolding to properly merge contigs, but didn't, as the script told it do disregard paired-end info. Assemblies that are already single-contig obviously aren't affected.

Got the greenlight from @ababaian to re-run some assemblies tonight. Turns out there aren't that many critically affected.

Among the 10,897 master table assemblies, 564 of the filltered ones (checkv or BGC) have >= 2 contigs OR identity lower than 97% than any complete genome OR were unaligned to any genome. I'll rerun those 564.

I'll wait for @taltman signal to rerun, as improvements to Darth are planned for these fragmented assemblies.

taltman commented 3 years ago

Thanks @rchikhi. If you have the data at your fingertips, could you please send me five examples from among the 564 with varying levels of fragmentation? I want to manually run DARTH on all of them and closely examine how it performs, to make sure that nothing is being missed due to there not being a single contig. Thanks!

rchikhi commented 3 years ago

Frank: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/ERR2756788.fa

65 contigs, 44 kbp total, 96% identity to complete genome neighbor: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/ERR4145311.fa

2 contigs, 2kbp total, unaligned to anything: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR8617922.fa

7 contigs, 15 kbp total, unaligned to anything: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR8389793.fa

3 contigs, 38 kbp total, 92.7% identity to closest neighbor: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR5447152.fa

taltman commented 3 years ago

Implemented and tested extensions to DARTH to make it handle multi-contig assemblies. Handed off to @rchikhi. Please let me know ASAP if you see any issues.

rchikhi commented 3 years ago

The new master table is here: s3://serratus-rayan/sra_master_table.csv the old master table is still available (just in case).

The new assemblies & annotations are at the same place (overwriting the old ones):

The new table has 11125 accessions. This is 229 more than before (10,896). This isn't due to rescaffolding. Probably these new accessions (such as SRR9943945, SRR9128947) didn't complete assembly by the time I had downloaded data for the previous master table.

A couple interesting before/after cases:

Frank:

accession length nb_contigs rs_neighbor rs_pctid genome_neighbor genome_pctid frag_neighbor frag_pctid tech
before ERR2756788 29237 2 NC_028824.1 83.0 MN611518.1 86.3 KF530128.1 90.5 ILLUMINA
after ERR2756788 29164 1 NC_028824.1 87.9 MN611518.1 90.3 KF530128.1 93.3 ILLUMINA

Another accession with a scaffolding effect:

accession length nb_contigs rs_neighbor rs_pctid genome_neighbor genome_pctid frag_neighbor frag_pctid tech
before SRR7287114 65229 6 NC_002306.3 92.7 KY292377.1 92.4 KF530130.1 91.7 ILLUMINA
after SRR7287114 28414 1 NC_002306.3 93.2 MN165107.1 92.0 KF530130.1 90.7 ILLUMINA

I'll post some more stats later e.g. how many accessions have changed and how many have not. Eyeballing the results show that the majority of differences are more minor than those above.

rchikhi commented 3 years ago

The list of the 564 accessions affected by this rescaffolding is here btw.

rchikhi commented 3 years ago

As per a discussion on Slack, here is the list of accessions which have a discrepancy (likely due to a bug in my scripts) between the old master table and the new master table, which isn't due to rescaffolding: SRR9966507 SRR9613509 (spotted by @asl) SRR9831606 SRR9432058 SRR9432086 SRR9432154

rchikhi commented 3 years ago

Annotations for BGC-filtered assemblies were not done. This is fixed now. All annotations for multi-contig assemblies have been re-run, as per https://github.com/ababaian/serratus/issues/213#issuecomment-663006125

rchikhi commented 3 years ago

Oh and also, I promised to post how many accessions have changed in the before/after master table. Here it is:

44 accessions out of the 564 rescaffolded ones have a difference in number of contigs or in total length.

Here is the before rescaffolding / after rescaffolding table.

accession #contigs (before) #contigs (after) Total length (before) Total length (after) * = wasn't part of the rescaffolding
SRR8389831 5 5 16519 14735
SRR8389837 5 3 13404 13306
SRR8389849 4 1 15646 14804
SRR9966507 1 1 419 251 *
SRR11907539 13 13 9154 8944
SRR8942329 3 2 22771 22698
ERR3994145 5 4 5952 5903
ERR380874 4 2 4522 4424
ERR380879 3 2 9062 9013
SRR5040909 4 3 35080 37425
SRR9613509 1 1 248 8997 *
SRR5040921 18 1 7745 591
ERR963021 2 2 6730 9592
SRR7267813 4 4 1139 6716
SRR11550034 5 5 4167 6366
SRR8731051 2 3 7268 9230
SRR7287110 4 3 45276 33975
SRR7287114 6 1 65229 28414
SRR9831606 1 1 605 211 *
SRR9432058 2 1 2277 993 *
ERR2744260 3 2 17885 19907
ERR1301485 22 16 224017 139228
ERR2744261 2 2 13520 14180
ERR2744262 2 2 6242 7646
ERR2744266 5 3 60972 62036
ERR2744268 3 2 27289 27216
DRR220588 3 3 45280 28490
SRR9432086 1 1 959 957 *
DRR220591 6 9 104768 111953
DRR220592 7 9 48852 82460
DRR220596 9 14 49451 95561
SRR11550087 3 3 6070 8137
SRR8570622 2 2 924 7085
ERR3569452 3 3 9399 9923
ERR2756788 2 1 29237 29164
ERR3569488 3 2 22772 25395
SRR1985227 2 1 3358 3408
SRR11913987 2 1 867 794
ERR1301573 11 11 74621 70314
SRR9432154 1 2 1194 2209 *
SRR7866683 3 3 848 2291
SRR8389791 7 3 20114 20859
SRR8389793 7 5 15535 16597
SRR8389794 5 4 15247 15198
rchikhi commented 3 years ago

Further checks:

After parsing coronaspades.txt log file, I extracted which assemblies were re-run as paired-end. There were 153 (out of 564 to rescaffold): https://serratus-public.s3.amazonaws.com/assemblies/analysis/rescaffolded.paired_end.in.log.txt The other ones were either not re-run or were re-run as single-end. (keep in mind that the 564 accessions that were scheduled to be rescaffolded were not necessarily having paired-end reads..)

Along the way, I manually noticed that a few paired-end samples fell through the cracks and haven't been reassembled, even though they should have been. What most likely happened is that the Batch job failed (for any reason, e.g. SRA download failed, instance got canceled, etc). Examples of such paired-end samples that should have been rescaffolded but aren't: ERR1600615, SRR5819067, SRR8741028.

Among the 153 true paired-end and rescaffolded accessions, I checked whether the unfiltered gene_clusters.fa file had N's: none did.

However, some whole assemblies have N's: for example, ERR1301475 does, in its contigs.fa.mfc file: https://serratus-public.s3.amazonaws.com/assemblies/other/ERR1301475.coronaspades/ERR1301475.coronaspades.contigs.fa.mfc (make sure to decompress it using MFCompress); but not in the gene_clusters.fa file.

So, I'm inclined to say that given the small number of paired-end rescaffolded accessions, we just didn't end up with a gene_clusters.fa file containing N's.