Closed rchikhi closed 3 years ago
Thanks @rchikhi. If you have the data at your fingertips, could you please send me five examples from among the 564 with varying levels of fragmentation? I want to manually run DARTH on all of them and closely examine how it performs, to make sure that nothing is being missed due to there not being a single contig. Thanks!
Frank: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/ERR2756788.fa
65 contigs, 44 kbp total, 96% identity to complete genome neighbor: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/ERR4145311.fa
2 contigs, 2kbp total, unaligned to anything: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR8617922.fa
7 contigs, 15 kbp total, unaligned to anything: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR8389793.fa
3 contigs, 38 kbp total, 92.7% identity to closest neighbor: https://serratus-rayan.s3.amazonaws.com/master_table_assemblies/SRR5447152.fa
Implemented and tested extensions to DARTH to make it handle multi-contig assemblies. Handed off to @rchikhi. Please let me know ASAP if you see any issues.
The new master table is here: s3://serratus-rayan/sra_master_table.csv
the old master table is still available (just in case).
The new assemblies & annotations are at the same place (overwriting the old ones):
The new table has 11125 accessions. This is 229 more than before (10,896). This isn't due to rescaffolding. Probably these new accessions (such as SRR9943945, SRR9128947) didn't complete assembly by the time I had downloaded data for the previous master table.
A couple interesting before/after cases:
Frank:
accession | length | nb_contigs | rs_neighbor | rs_pctid | genome_neighbor | genome_pctid | frag_neighbor | frag_pctid | tech | |
---|---|---|---|---|---|---|---|---|---|---|
before | ERR2756788 | 29237 | 2 | NC_028824.1 | 83.0 | MN611518.1 | 86.3 | KF530128.1 | 90.5 | ILLUMINA |
after | ERR2756788 | 29164 | 1 | NC_028824.1 | 87.9 | MN611518.1 | 90.3 | KF530128.1 | 93.3 | ILLUMINA |
Another accession with a scaffolding effect:
accession | length | nb_contigs | rs_neighbor | rs_pctid | genome_neighbor | genome_pctid | frag_neighbor | frag_pctid | tech | |
---|---|---|---|---|---|---|---|---|---|---|
before | SRR7287114 | 65229 | 6 | NC_002306.3 | 92.7 | KY292377.1 | 92.4 | KF530130.1 | 91.7 | ILLUMINA |
after | SRR7287114 | 28414 | 1 | NC_002306.3 | 93.2 | MN165107.1 | 92.0 | KF530130.1 | 90.7 | ILLUMINA |
I'll post some more stats later e.g. how many accessions have changed and how many have not. Eyeballing the results show that the majority of differences are more minor than those above.
The list of the 564 accessions affected by this rescaffolding is here btw.
As per a discussion on Slack, here is the list of accessions which have a discrepancy (likely due to a bug in my scripts) between the old master table and the new master table, which isn't due to rescaffolding: SRR9966507 SRR9613509 (spotted by @asl) SRR9831606 SRR9432058 SRR9432086 SRR9432154
Annotations for BGC-filtered assemblies were not done. This is fixed now. All annotations for multi-contig assemblies have been re-run, as per https://github.com/ababaian/serratus/issues/213#issuecomment-663006125
Oh and also, I promised to post how many accessions have changed in the before/after master table. Here it is:
44 accessions out of the 564 rescaffolded ones have a difference in number of contigs or in total length.
Here is the before rescaffolding / after rescaffolding table.
accession | #contigs (before) | #contigs (after) | Total length (before) | Total length (after) | * = wasn't part of the rescaffolding |
---|---|---|---|---|---|
SRR8389831 | 5 | 5 | 16519 | 14735 | |
SRR8389837 | 5 | 3 | 13404 | 13306 | |
SRR8389849 | 4 | 1 | 15646 | 14804 | |
SRR9966507 | 1 | 1 | 419 | 251 | * |
SRR11907539 | 13 | 13 | 9154 | 8944 | |
SRR8942329 | 3 | 2 | 22771 | 22698 | |
ERR3994145 | 5 | 4 | 5952 | 5903 | |
ERR380874 | 4 | 2 | 4522 | 4424 | |
ERR380879 | 3 | 2 | 9062 | 9013 | |
SRR5040909 | 4 | 3 | 35080 | 37425 | |
SRR9613509 | 1 | 1 | 248 | 8997 | * |
SRR5040921 | 18 | 1 | 7745 | 591 | |
ERR963021 | 2 | 2 | 6730 | 9592 | |
SRR7267813 | 4 | 4 | 1139 | 6716 | |
SRR11550034 | 5 | 5 | 4167 | 6366 | |
SRR8731051 | 2 | 3 | 7268 | 9230 | |
SRR7287110 | 4 | 3 | 45276 | 33975 | |
SRR7287114 | 6 | 1 | 65229 | 28414 | |
SRR9831606 | 1 | 1 | 605 | 211 | * |
SRR9432058 | 2 | 1 | 2277 | 993 | * |
ERR2744260 | 3 | 2 | 17885 | 19907 | |
ERR1301485 | 22 | 16 | 224017 | 139228 | |
ERR2744261 | 2 | 2 | 13520 | 14180 | |
ERR2744262 | 2 | 2 | 6242 | 7646 | |
ERR2744266 | 5 | 3 | 60972 | 62036 | |
ERR2744268 | 3 | 2 | 27289 | 27216 | |
DRR220588 | 3 | 3 | 45280 | 28490 | |
SRR9432086 | 1 | 1 | 959 | 957 | * |
DRR220591 | 6 | 9 | 104768 | 111953 | |
DRR220592 | 7 | 9 | 48852 | 82460 | |
DRR220596 | 9 | 14 | 49451 | 95561 | |
SRR11550087 | 3 | 3 | 6070 | 8137 | |
SRR8570622 | 2 | 2 | 924 | 7085 | |
ERR3569452 | 3 | 3 | 9399 | 9923 | |
ERR2756788 | 2 | 1 | 29237 | 29164 | |
ERR3569488 | 3 | 2 | 22772 | 25395 | |
SRR1985227 | 2 | 1 | 3358 | 3408 | |
SRR11913987 | 2 | 1 | 867 | 794 | |
ERR1301573 | 11 | 11 | 74621 | 70314 | |
SRR9432154 | 1 | 2 | 1194 | 2209 | * |
SRR7866683 | 3 | 3 | 848 | 2291 | |
SRR8389791 | 7 | 3 | 20114 | 20859 | |
SRR8389793 | 7 | 5 | 15535 | 16597 | |
SRR8389794 | 5 | 4 | 15247 | 15198 |
Further checks:
After parsing coronaspades.txt
log file, I extracted which assemblies were re-run as paired-end. There were 153 (out of 564 to rescaffold): https://serratus-public.s3.amazonaws.com/assemblies/analysis/rescaffolded.paired_end.in.log.txt
The other ones were either not re-run or were re-run as single-end. (keep in mind that the 564 accessions that were scheduled to be rescaffolded were not necessarily having paired-end reads..)
Along the way, I manually noticed that a few paired-end samples fell through the cracks and haven't been reassembled, even though they should have been. What most likely happened is that the Batch job failed (for any reason, e.g. SRA download failed, instance got canceled, etc). Examples of such paired-end samples that should have been rescaffolded but aren't: ERR1600615
, SRR5819067
, SRR8741028
.
Among the 153 true paired-end and rescaffolded accessions, I checked whether the unfiltered gene_clusters.fa
file had N's: none did.
However, some whole assemblies have N's: for example, ERR1301475 does, in its contigs.fa.mfc
file:
https://serratus-public.s3.amazonaws.com/assemblies/other/ERR1301475.coronaspades/ERR1301475.coronaspades.contigs.fa.mfc
(make sure to decompress it using MFCompress); but not in the gene_clusters.fa
file.
So, I'm inclined to say that given the small number of paired-end rescaffolded accessions, we just didn't end up with a gene_clusters.fa
file containing N's.
This is just to bring a matter to your attention since it affects assemblies inside the master table. Input is not necessarily needed.
We have found a problem with all assemblies: they were all run with single-end mode, regardless of whether reads were paired-end. This was due a bug in my script where I trusted the output of a read type detection software, which turned out to always spit 'single-end' on fastp output. So, as a result, assemblies were sometimes fragmented into 2 or more contigs. coronaSPAdes could have performed paired-end scaffolding to properly merge contigs, but didn't, as the script told it do disregard paired-end info. Assemblies that are already single-contig obviously aren't affected.
Got the greenlight from @ababaian to re-run some assemblies tonight. Turns out there aren't that many critically affected.
Among the 10,897 master table assemblies, 564 of the filltered ones (checkv or BGC) have >= 2 contigs OR identity lower than 97% than any complete genome OR were unaligned to any genome. I'll rerun those 564.
I'll wait for @taltman signal to rerun, as improvements to Darth are planned for these fragmented assemblies.