ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
763 stars 139 forks source link

Genome coverage inflection point for assembly quality (contig number and BUSCO) #1173

Open 000generic opened 1 year ago

000generic commented 1 year ago

Description of bug

I am using Spades for genome assembly of Illumina PE data and while it runs great to completion on each job I've done, I'm finding there is an inflection point in genome coverage after which the number of contigs dramatically increases while the number of complete BUSCO genes dramatically decreases. I'm new to using Spades and this may be a common finding but it surprised me - and I didn't see anything in the issues current or archived that I checked out. I've found the same similar result across a series of genome assemblies for two different species in the same genus (Lottia) so far.

My input data are Illumina PE reads (ILLUMINA (HiSeq X Ten) run: 83.3M spots, 25G bases) generated this spring. My pipeline going into Spades includes dropping all reads with greater than 5% Ns. The expected genome size for either of the species is 300-350 Mb. I filter the reads to produce 5x, 10x, 20x, 25x, 30x, 35x, 40x, 50x, and all read data sets. I then run Spades to assemble each and compare statistics like number of contigs, n50, and BUSCO scores for Eukaryota, Metazoa, and Mollusca. What I find is that

For 5x to 25x or 30x read coverage of an estimage 300 mB genome, the number of contigs decreases (400,000s down to 180,000s) and the number of BUSCO complete increases (from <10% up to almost 70%)

HOWEVER

one step up in coverage from 25x to 30x (Lottia scutum) or from 30x to 35x (Lottia digitalis) and the number of contigs explodes to ~2 million and increase up to 5 million with increasing numbers of reads. At the same time BUSCO scores drop precipitously to around 35% and can go lower with further increases in reads (down to upper 20s at worst). You can see examples for Lottia digitalis attached.

Genome assembly stats - Sheet1.pdf

The n50s (and AuNs) are more consistent/more like what I expected over the coverage series but then the assemblies are all fairly fragmented so not sure how useful they are as stats.

We sequenced wild-caught animals - so might have high heterozygosity - I guess this could be causing things to fragment but not parallel assemble haplotypes in Spades (contigs increase and BUSCO single complete decreases but BUSCO duplicates do not increase at the same time - vs if haplotypes co-assembled I would expect BUSCO duplicates to be high) - and even if this is a effect of heterozygosity, I don't see why it should happen so dramatically as what seems to be a kind of inflection point.

So not sure what to make of it - if its the informatics x artifacts or biology of the genomes at play - or if I've simply done something wrong in running spades or in generating the read subsets - although the full read data sets without any filtering for N-rich reads or filtering for coverage fit the generally pattern - suggesting its not the pre-Spades processing that I did.

Any suggestions would be greatly appreciated! If there is a previous issue on this already, I apologize that I missed it.

Thank you :) Eric

spades.log

30x-spades.log 35x-spades.log

params.txt

30x-params.txt 35x-params.txt

SPAdes version

Spades 3.15.5

Operating System

Debian Buster v10.X. CPU Count: 64 : "GenuineIntel Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz (2 chips x 16 cores : 32 hyperthread cores)"

Python Version

Python 3.11.4

Method of SPAdes installation

Conda

No errors reported in spades.log

000generic commented 1 year ago

If there are any comments or guidance on this, I can include them in the talk I'm recording this Friday for the Biodiversity Genomics conference in October. I'll being showing the coverage pattern but for now will leave it as unresolved as to what is generating the pattern - I'm assuming some mistake or artefact on my side but have no idea what to make of it really.

Thank you!

asl commented 1 year ago

Well, # of contigs frankly speaking does not mean anything. What is much more important is that the length of assembly increased drastically. You 50x assembly is twice as large as 30x.

Since you're assembling something wild-caught, then you need to consider multiple factors:

So, as coverage coverage increases you're starting to assemble all low-covered stuff: you're seeing more variants (inter- and intra- genome), you're seeing more contamination. And, more importantly, these artifacts start to assemble (so assembler will not remove them as sequencing errors or other relatively ow-covered stuff).

PS: You have not attached your spades.log files, they are identical to params

000generic commented 1 year ago

Hi ASL - thanks for your reply!

Sorry for the mistake on uploading - attached are the log files.

I agree with what you highlight - however its worth noting, for example, that the expected genome size is reached at 30x (a 2015 genome publication for a species in the genus has a genome of 300-350 Mb - and at 30x in the assemblies genome size is at ~330-360 Mb) - but then genome size loosely doubles going from 30x to 35x and goes beyond double as coverage increases. This feels like maybe the haplotypes suddenly / dramatically co-assembled across the transition from 30x to 35x in coverage, given the near doubling of expected genome size. However BUSCO scores cut in half with no real change in duplicates. So maybe the heterozygosity is suddenly taking hold at 35x causing the haplotypes to attempt co-assembly but largely failing with a jumbling of things so that many previously well formed genes at 30x are corrupted at 35x and unrecognized by BUSCO....?

Overall, its not specifically number of contigs - but the interplay of coverage - contigs - and busco that confuses me. BUSCO increases nicely (as does number of contigs) as coverage increases and peaks at 30x and but then BUSCO cuts in half at 35x while the number of contigs increases dramatically (loosely doubles) at 35x. A similar pattern is seen in another species in the genus but inflection point is at 25x to 30x rather than 30x to 35x. I have more species in the genus and several other clades with multiple species per genus I can similarly explore in the coming month.

Also of note in BUSCO there is no increase in duplicates as coverage increases before and after the inflection point - so seems unlikely to be contamination - sequencing is of single individuals so like above - it could be heterozygosity of haplotypes in wild-caught species - it kinda feels that way in a sense, given how BUSCO score is around cut in half and assembly size roughly doubles at the inflection point - but a its a very precise response within a series of coverages - and in two different species - and I would think heterozygosity would be more of a graded response with increasing coverage - but maybe there is a threshold in Spades - or something like this?

Attached are BUSCO scores with coverage to give a better sense of the pattern in one species.

Maybe the logs can offer ideas for parameters to test / offer hints of what might be going on.

Thanks again for your helpful reply and taking the time to look at things!

Lottia digitalis coverage vs busco

spades-30x.log spades-35x.log

asl commented 1 year ago

As far as I can see from the log 30x case is definitely a coincidence.

Because of the elevated levels of heterozygosity (and low coverage) SPAdes is having hard times to estimate various coverage-based cutoffs. E.g.:

  0:09:01.804  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 181)   Kmer coverage valley at: 0
  0:09:01.807  2831M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 201)   K-mer histogram maximum: 4
  0:09:01.811  2831M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 237)   Estimated median coverage: 5. Coverage mad: 4.4478
  0:09:01.813  2831M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 259)   Fitting coverage model
  0:09:01.827  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 2
  0:09:01.852  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 4
  0:09:01.920  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 8
  0:09:02.053  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 16
  0:09:02.298  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 32
  0:09:02.803  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 64
  0:09:03.784  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 128
  0:09:05.261  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 309)   Fitted mean coverage: 7.57875. Fitted coverage std. dev: 3.52368
  0:09:05.265  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 334)   Probability of erroneous kmer at valley: 0.992904
  0:09:05.271  2832M / 18G   WARN    General                 (kmer_coverage_model.cpp   : 366)   Failed to determine erroneous kmer threshold. Threshold set to: 0
  0:09:05.273  2832M / 18G   INFO    General                 (kmer_coverage_model.cpp   : 375)   Estimated genome size (ignoring repeats): 0
  0:09:05.276  2831M / 18G   INFO    General                 (genomic_info_filler.cpp   :  57)   Failed to estimate mean coverage
  0:09:05.279  2831M / 18G   INFO    General                 (genomic_info_filler.cpp   :  70)   EC coverage threshold value was calculated as 0

As a result, the coverage cutoff was set to zero. This happens for both "30x" case (in majority of cases when we had to estimate the coverage) and "35x" case (the real coverage seems to be ~10x actually here).

However, in 30x case suddenly the problem went through the cracks and we had:

  0:08:34.352  2342M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 237)   Estimated median coverage: 118. Coverage mad: 2.9652
  0:08:34.356  2342M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 259)   Fitting coverage model
  0:08:34.459  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 2
  0:08:34.667  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 4
  0:08:35.406  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 8
  0:08:36.308  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 295)   ... iteration 16
  0:08:37.913  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 309)   Fitted mean coverage: 74.3188. Fitted coverage std. dev: 34.2389
  0:08:37.918  2343M / 20G   WARN    General                 (kmer_coverage_model.cpp   : 327)   Valley value was estimated improperly, reset to 14
  0:08:37.927  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 334)   Probability of erroneous kmer at valley: 1
  0:08:37.929  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 358)   Preliminary threshold calculated as: 48
  0:08:37.933  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 362)   Threshold adjusted to: 44
  0:08:37.934  2343M / 20G   INFO    General                 (kmer_coverage_model.cpp   : 375)   Estimated genome size (ignoring repeats): 2858018
  0:08:37.934  2342M / 20G   INFO    General                 (genomic_info_filler.cpp   :  55)   Mean coverage was calculated as 74.3188
  0:08:37.934  2342M / 20G   INFO    General                 (genomic_info_filler.cpp   :  70)   EC coverage threshold value was calculated as 44
  0:08:37.934  2342M / 20G   INFO    General                 (genomic_info_filler.cpp   :  71)   Trusted kmer low bound: 0

So, the k-mer coverage model went crazy and the genome coverage was estimated to be 75x putting the erroneous connections coverage threshold to be 44x (instead of conservative case of 0! in other case).

As a result, it seems that lots of real genome parts was randomly removed here and there judging by this erroneous elevated threshold. This is from 30x log:

  0:17:55.818    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 372305 times
  0:18:48.523    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 244139 times
  0:19:12.931    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 46501 times
  0:19:20.325    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 16245 times
  0:19:23.379    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 8373 times
  0:19:25.140    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 5124 times
  0:19:26.361    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 3500 times
  0:19:27.315    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 2600 times
  0:19:28.072    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 1955 times
  0:19:28.708    12G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 1602 times
...

And the corresponding 35x log:

  0:18:37.385    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 25077 times
  0:18:41.967    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.138    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.278    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.413    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.547    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.692    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.831    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
  0:18:42.976    14G / 20G   INFO   Simplification           (parallel_processing.hpp   : 170)   Low coverage edge remover triggered 0 times
...

Mind the differences!

That said, I would say that 30x assembly results should be taken with huge pile of salt.

You're doing your assemblies in normal isolate mode that expects even coverage of the genome in question. Apparently, this is not your case. I would suggest you to give metagenome / single-cell mode a try.