Open 000generic opened 1 year ago
If there are any comments or guidance on this, I can include them in the talk I'm recording this Friday for the Biodiversity Genomics conference in October. I'll being showing the coverage pattern but for now will leave it as unresolved as to what is generating the pattern - I'm assuming some mistake or artefact on my side but have no idea what to make of it really.
Thank you!
Well, # of contigs frankly speaking does not mean anything. What is much more important is that the length of assembly increased drastically. You 50x assembly is twice as large as 30x.
Since you're assembling something wild-caught, then you need to consider multiple factors:
So, as coverage coverage increases you're starting to assemble all low-covered stuff: you're seeing more variants (inter- and intra- genome), you're seeing more contamination. And, more importantly, these artifacts start to assemble (so assembler will not remove them as sequencing errors or other relatively ow-covered stuff).
PS: You have not attached your spades.log files, they are identical to params
Hi ASL - thanks for your reply!
Sorry for the mistake on uploading - attached are the log files.
I agree with what you highlight - however its worth noting, for example, that the expected genome size is reached at 30x (a 2015 genome publication for a species in the genus has a genome of 300-350 Mb - and at 30x in the assemblies genome size is at ~330-360 Mb) - but then genome size loosely doubles going from 30x to 35x and goes beyond double as coverage increases. This feels like maybe the haplotypes suddenly / dramatically co-assembled across the transition from 30x to 35x in coverage, given the near doubling of expected genome size. However BUSCO scores cut in half with no real change in duplicates. So maybe the heterozygosity is suddenly taking hold at 35x causing the haplotypes to attempt co-assembly but largely failing with a jumbling of things so that many previously well formed genes at 30x are corrupted at 35x and unrecognized by BUSCO....?
Overall, its not specifically number of contigs - but the interplay of coverage - contigs - and busco that confuses me. BUSCO increases nicely (as does number of contigs) as coverage increases and peaks at 30x and but then BUSCO cuts in half at 35x while the number of contigs increases dramatically (loosely doubles) at 35x. A similar pattern is seen in another species in the genus but inflection point is at 25x to 30x rather than 30x to 35x. I have more species in the genus and several other clades with multiple species per genus I can similarly explore in the coming month.
Also of note in BUSCO there is no increase in duplicates as coverage increases before and after the inflection point - so seems unlikely to be contamination - sequencing is of single individuals so like above - it could be heterozygosity of haplotypes in wild-caught species - it kinda feels that way in a sense, given how BUSCO score is around cut in half and assembly size roughly doubles at the inflection point - but a its a very precise response within a series of coverages - and in two different species - and I would think heterozygosity would be more of a graded response with increasing coverage - but maybe there is a threshold in Spades - or something like this?
Attached are BUSCO scores with coverage to give a better sense of the pattern in one species.
Maybe the logs can offer ideas for parameters to test / offer hints of what might be going on.
Thanks again for your helpful reply and taking the time to look at things!
As far as I can see from the log 30x case is definitely a coincidence.
Because of the elevated levels of heterozygosity (and low coverage) SPAdes is having hard times to estimate various coverage-based cutoffs. E.g.:
0:09:01.804 2832M / 18G INFO General (kmer_coverage_model.cpp : 181) Kmer coverage valley at: 0
0:09:01.807 2831M / 18G INFO General (kmer_coverage_model.cpp : 201) K-mer histogram maximum: 4
0:09:01.811 2831M / 18G INFO General (kmer_coverage_model.cpp : 237) Estimated median coverage: 5. Coverage mad: 4.4478
0:09:01.813 2831M / 18G INFO General (kmer_coverage_model.cpp : 259) Fitting coverage model
0:09:01.827 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 2
0:09:01.852 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 4
0:09:01.920 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 8
0:09:02.053 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 16
0:09:02.298 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 32
0:09:02.803 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 64
0:09:03.784 2832M / 18G INFO General (kmer_coverage_model.cpp : 295) ... iteration 128
0:09:05.261 2832M / 18G INFO General (kmer_coverage_model.cpp : 309) Fitted mean coverage: 7.57875. Fitted coverage std. dev: 3.52368
0:09:05.265 2832M / 18G INFO General (kmer_coverage_model.cpp : 334) Probability of erroneous kmer at valley: 0.992904
0:09:05.271 2832M / 18G WARN General (kmer_coverage_model.cpp : 366) Failed to determine erroneous kmer threshold. Threshold set to: 0
0:09:05.273 2832M / 18G INFO General (kmer_coverage_model.cpp : 375) Estimated genome size (ignoring repeats): 0
0:09:05.276 2831M / 18G INFO General (genomic_info_filler.cpp : 57) Failed to estimate mean coverage
0:09:05.279 2831M / 18G INFO General (genomic_info_filler.cpp : 70) EC coverage threshold value was calculated as 0
As a result, the coverage cutoff was set to zero. This happens for both "30x" case (in majority of cases when we had to estimate the coverage) and "35x" case (the real coverage seems to be ~10x actually here).
However, in 30x case suddenly the problem went through the cracks and we had:
0:08:34.352 2342M / 20G INFO General (kmer_coverage_model.cpp : 237) Estimated median coverage: 118. Coverage mad: 2.9652
0:08:34.356 2342M / 20G INFO General (kmer_coverage_model.cpp : 259) Fitting coverage model
0:08:34.459 2343M / 20G INFO General (kmer_coverage_model.cpp : 295) ... iteration 2
0:08:34.667 2343M / 20G INFO General (kmer_coverage_model.cpp : 295) ... iteration 4
0:08:35.406 2343M / 20G INFO General (kmer_coverage_model.cpp : 295) ... iteration 8
0:08:36.308 2343M / 20G INFO General (kmer_coverage_model.cpp : 295) ... iteration 16
0:08:37.913 2343M / 20G INFO General (kmer_coverage_model.cpp : 309) Fitted mean coverage: 74.3188. Fitted coverage std. dev: 34.2389
0:08:37.918 2343M / 20G WARN General (kmer_coverage_model.cpp : 327) Valley value was estimated improperly, reset to 14
0:08:37.927 2343M / 20G INFO General (kmer_coverage_model.cpp : 334) Probability of erroneous kmer at valley: 1
0:08:37.929 2343M / 20G INFO General (kmer_coverage_model.cpp : 358) Preliminary threshold calculated as: 48
0:08:37.933 2343M / 20G INFO General (kmer_coverage_model.cpp : 362) Threshold adjusted to: 44
0:08:37.934 2343M / 20G INFO General (kmer_coverage_model.cpp : 375) Estimated genome size (ignoring repeats): 2858018
0:08:37.934 2342M / 20G INFO General (genomic_info_filler.cpp : 55) Mean coverage was calculated as 74.3188
0:08:37.934 2342M / 20G INFO General (genomic_info_filler.cpp : 70) EC coverage threshold value was calculated as 44
0:08:37.934 2342M / 20G INFO General (genomic_info_filler.cpp : 71) Trusted kmer low bound: 0
So, the k-mer coverage model went crazy and the genome coverage was estimated to be 75x putting the erroneous connections coverage threshold to be 44x (instead of conservative case of 0! in other case).
As a result, it seems that lots of real genome parts was randomly removed here and there judging by this erroneous elevated threshold. This is from 30x log:
0:17:55.818 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 372305 times
0:18:48.523 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 244139 times
0:19:12.931 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 46501 times
0:19:20.325 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 16245 times
0:19:23.379 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 8373 times
0:19:25.140 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 5124 times
0:19:26.361 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 3500 times
0:19:27.315 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 2600 times
0:19:28.072 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 1955 times
0:19:28.708 12G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 1602 times
...
And the corresponding 35x log:
0:18:37.385 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 25077 times
0:18:41.967 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.138 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.278 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.413 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.547 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.692 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.831 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
0:18:42.976 14G / 20G INFO Simplification (parallel_processing.hpp : 170) Low coverage edge remover triggered 0 times
...
Mind the differences!
That said, I would say that 30x assembly results should be taken with huge pile of salt.
You're doing your assemblies in normal isolate mode that expects even coverage of the genome in question. Apparently, this is not your case. I would suggest you to give metagenome / single-cell mode a try.
Description of bug
I am using Spades for genome assembly of Illumina PE data and while it runs great to completion on each job I've done, I'm finding there is an inflection point in genome coverage after which the number of contigs dramatically increases while the number of complete BUSCO genes dramatically decreases. I'm new to using Spades and this may be a common finding but it surprised me - and I didn't see anything in the issues current or archived that I checked out. I've found the same similar result across a series of genome assemblies for two different species in the same genus (Lottia) so far.
My input data are Illumina PE reads (ILLUMINA (HiSeq X Ten) run: 83.3M spots, 25G bases) generated this spring. My pipeline going into Spades includes dropping all reads with greater than 5% Ns. The expected genome size for either of the species is 300-350 Mb. I filter the reads to produce 5x, 10x, 20x, 25x, 30x, 35x, 40x, 50x, and all read data sets. I then run Spades to assemble each and compare statistics like number of contigs, n50, and BUSCO scores for Eukaryota, Metazoa, and Mollusca. What I find is that
For 5x to 25x or 30x read coverage of an estimage 300 mB genome, the number of contigs decreases (400,000s down to 180,000s) and the number of BUSCO complete increases (from <10% up to almost 70%)
HOWEVER
one step up in coverage from 25x to 30x (Lottia scutum) or from 30x to 35x (Lottia digitalis) and the number of contigs explodes to ~2 million and increase up to 5 million with increasing numbers of reads. At the same time BUSCO scores drop precipitously to around 35% and can go lower with further increases in reads (down to upper 20s at worst). You can see examples for Lottia digitalis attached.
Genome assembly stats - Sheet1.pdf
The n50s (and AuNs) are more consistent/more like what I expected over the coverage series but then the assemblies are all fairly fragmented so not sure how useful they are as stats.
We sequenced wild-caught animals - so might have high heterozygosity - I guess this could be causing things to fragment but not parallel assemble haplotypes in Spades (contigs increase and BUSCO single complete decreases but BUSCO duplicates do not increase at the same time - vs if haplotypes co-assembled I would expect BUSCO duplicates to be high) - and even if this is a effect of heterozygosity, I don't see why it should happen so dramatically as what seems to be a kind of inflection point.
So not sure what to make of it - if its the informatics x artifacts or biology of the genomes at play - or if I've simply done something wrong in running spades or in generating the read subsets - although the full read data sets without any filtering for N-rich reads or filtering for coverage fit the generally pattern - suggesting its not the pre-Spades processing that I did.
Any suggestions would be greatly appreciated! If there is a previous issue on this already, I apologize that I missed it.
Thank you :) Eric
spades.log
30x-spades.log 35x-spades.log
params.txt
30x-params.txt 35x-params.txt
SPAdes version
Spades 3.15.5
Operating System
Debian Buster v10.X. CPU Count: 64 : "GenuineIntel Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (2 chips x 16 cores : 32 hyperthread cores)"
Python Version
Python 3.11.4
Method of SPAdes installation
Conda
No errors reported in spades.log