A portion of the master table assemblies will be re-run/re-annotated

rchikhi commented 3 years ago

This is just to bring a matter to your attention since it affects assemblies inside the master table. Input is not necessarily needed.

We have found a problem with all assemblies: they were all run with single-end mode, regardless of whether reads were paired-end. This was due a bug in my script where I trusted the output of a read type detection software, which turned out to always spit 'single-end' on fastp output. So, as a result, assemblies were sometimes fragmented into 2 or more contigs. coronaSPAdes could have performed paired-end scaffolding to properly merge contigs, but didn't, as the script told it do disregard paired-end info. Assemblies that are already single-contig obviously aren't affected.

Got the greenlight from @ababaian to re-run some assemblies tonight. Turns out there aren't that many critically affected.

Among the 10,897 master table assemblies, 564 of the filltered ones (checkv or BGC) have >= 2 contigs OR identity lower than 97% than any complete genome OR were unaligned to any genome. I'll rerun those 564.

I'll wait for @taltman signal to rerun, as improvements to Darth are planned for these fragmented assemblies.

taltman commented 3 years ago

Thanks @rchikhi. If you have the data at your fingertips, could you please send me five examples from among the 564 with varying levels of fragmentation? I want to manually run DARTH on all of them and closely examine how it performs, to make sure that nothing is being missed due to there not being a single contig. Thanks!

rchikhi commented 3 years ago

Frank: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/ERR2756788.fa

65 contigs, 44 kbp total, 96% identity to complete genome neighbor: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/ERR4145311.fa

2 contigs, 2kbp total, unaligned to anything: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR8617922.fa

7 contigs, 15 kbp total, unaligned to anything: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR8389793.fa

3 contigs, 38 kbp total, 92.7% identity to closest neighbor: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR5447152.fa

taltman commented 3 years ago

Implemented and tested extensions to DARTH to make it handle multi-contig assemblies. Handed off to @rchikhi. Please let me know ASAP if you see any issues.

rchikhi commented 3 years ago

The new master table is here: s3://serratus-rayan/sra_master_table.csv the old master table is still available (just in case).

The new assemblies & annotations are at the same place (overwriting the old ones):

The new table has 11125 accessions. This is 229 more than before (10,896). This isn't due to rescaffolding. Probably these new accessions (such as SRR9943945, SRR9128947) didn't complete assembly by the time I had downloaded data for the previous master table.

A couple interesting before/after cases:

Frank:

	accession	length	nb_contigs	rs_neighbor	rs_pctid	genome_neighbor	genome_pctid	frag_neighbor	frag_pctid	tech
before	ERR2756788	29237	2	NC_028824.1	83.0	MN611518.1	86.3	KF530128.1	90.5	ILLUMINA
after	ERR2756788	29164	1	NC_028824.1	87.9	MN611518.1	90.3	KF530128.1	93.3	ILLUMINA

Another accession with a scaffolding effect:

	accession	length	nb_contigs	rs_neighbor	rs_pctid	genome_neighbor	genome_pctid	frag_neighbor	frag_pctid	tech
before	SRR7287114	65229	6	NC_002306.3	92.7	KY292377.1	92.4	KF530130.1	91.7	ILLUMINA
after	SRR7287114	28414	1	NC_002306.3	93.2	MN165107.1	92.0	KF530130.1	90.7	ILLUMINA

I'll post some more stats later e.g. how many accessions have changed and how many have not. Eyeballing the results show that the majority of differences are more minor than those above.

rchikhi commented 3 years ago

The list of the 564 accessions affected by this rescaffolding is here btw.

rchikhi commented 3 years ago

As per a discussion on Slack, here is the list of accessions which have a discrepancy (likely due to a bug in my scripts) between the old master table and the new master table, which isn't due to rescaffolding: SRR9966507 SRR9613509 (spotted by @asl) SRR9831606 SRR9432058 SRR9432086 SRR9432154

rchikhi commented 3 years ago

Annotations for BGC-filtered assemblies were not done. This is fixed now. All annotations for multi-contig assemblies have been re-run, as per https://github.com/ababaian/serratus/issues/213#issuecomment-663006125

rchikhi commented 3 years ago

Oh and also, I promised to post how many accessions have changed in the before/after master table. Here it is:

44 accessions out of the 564 rescaffolded ones have a difference in number of contigs or in total length.

Here is the before rescaffolding / after rescaffolding table.

accession	#contigs (before)	#contigs (after)	Total length (before)	Total length (after)	* = wasn't part of the rescaffolding
SRR8389831	5	5	16519	14735
SRR8389837	5	3	13404	13306
SRR8389849	4	1	15646	14804
SRR9966507	1	1	419	251	*
SRR11907539	13	13	9154	8944
SRR8942329	3	2	22771	22698
ERR3994145	5	4	5952	5903
ERR380874	4	2	4522	4424
ERR380879	3	2	9062	9013
SRR5040909	4	3	35080	37425
SRR9613509	1	1	248	8997	*
SRR5040921	18	1	7745	591
ERR963021	2	2	6730	9592
SRR7267813	4	4	1139	6716
SRR11550034	5	5	4167	6366
SRR8731051	2	3	7268	9230
SRR7287110	4	3	45276	33975
SRR7287114	6	1	65229	28414
SRR9831606	1	1	605	211	*
SRR9432058	2	1	2277	993	*
ERR2744260	3	2	17885	19907
ERR1301485	22	16	224017	139228
ERR2744261	2	2	13520	14180
ERR2744262	2	2	6242	7646
ERR2744266	5	3	60972	62036
ERR2744268	3	2	27289	27216
DRR220588	3	3	45280	28490
SRR9432086	1	1	959	957	*
DRR220591	6	9	104768	111953
DRR220592	7	9	48852	82460
DRR220596	9	14	49451	95561
SRR11550087	3	3	6070	8137
SRR8570622	2	2	924	7085
ERR3569452	3	3	9399	9923
ERR2756788	2	1	29237	29164
ERR3569488	3	2	22772	25395
SRR1985227	2	1	3358	3408
SRR11913987	2	1	867	794
ERR1301573	11	11	74621	70314
SRR9432154	1	2	1194	2209	*
SRR7866683	3	3	848	2291
SRR8389791	7	3	20114	20859
SRR8389793	7	5	15535	16597
SRR8389794	5	4	15247	15198

rchikhi commented 3 years ago

Further checks:

After parsing coronaspades.txt log file, I extracted which assemblies were re-run as paired-end. There were 153 (out of 564 to rescaffold): https://serratus-public.s3.amazonaws.com/assemblies/analysis/rescaffolded.paired_end.in.log.txt The other ones were either not re-run or were re-run as single-end. (keep in mind that the 564 accessions that were scheduled to be rescaffolded were not necessarily having paired-end reads..)

Along the way, I manually noticed that a few paired-end samples fell through the cracks and haven't been reassembled, even though they should have been. What most likely happened is that the Batch job failed (for any reason, e.g. SRA download failed, instance got canceled, etc). Examples of such paired-end samples that should have been rescaffolded but aren't: ERR1600615, SRR5819067, SRR8741028.

Among the 153 true paired-end and rescaffolded accessions, I checked whether the unfiltered gene_clusters.fa file had N's: none did.

However, some whole assemblies have N's: for example, ERR1301475 does, in its contigs.fa.mfc file: https://serratus-public.s3.amazonaws.com/assemblies/other/ERR1301475.coronaspades/ERR1301475.coronaspades.contigs.fa.mfc (make sure to decompress it using MFCompress); but not in the gene_clusters.fa file.

So, I'm inclined to say that given the small number of paired-end rescaffolded accessions, we just didn't end up with a gene_clusters.fa file containing N's.

ababaian / serratus

A portion of the master table assemblies will be re-run/re-annotated #214