ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
749 stars 135 forks source link

Explanation of how --cov-cutoff works #18

Closed tseemann closed 7 years ago

tseemann commented 7 years ago

The cov-cutoff parameter remains a mystery to the Spades user community. It used to be auto and now it is off.

Would it be possible to add an explaination to the document explaining it?

Common results are getting contigs with coverages of < 1.0

wangyugui commented 7 years ago

'a positive float number' is explaned in help output.

what is the meaning of 0--1.0 and over 1.0?

asl commented 7 years ago

what is the meaning of 0--1.0 and over 1.0?

See http://cab.spbu.ru/files/release3.10.1/manual.html#sec3.5 that explains what is the coverage reported.

wangyugui commented 7 years ago

Does cov-cutoff is used to filter contig ouput? It is not used for filter fastq inut by kmer coverage?

tseemann commented 7 years ago

Here is what the manual section 3.5 says:

Contigs/scaffolds names in SPAdes output FASTA files have the following format: 

>NODE_3_length_237403_cov_243.207_ID_45

Here 3 is the number of the contig/scaffold, 237403 is the sequence length in nucleotides and 243.207 is the k-mer coverage for the last (largest) k value used. 

Note that the k-mer coverage is always lower than the read (per-base) coverage.

The only way to get k-mer coverage < 1 is to have a contig which is less than the k_max ?

(which can happen in a section of a de bruijn graph when breaking into contigs)

wangyugui commented 7 years ago

the --cov-cutoff of SPAdes is after assembly? people may want a low k-mer coverage filter before assembly and to speed up the assembly.

kmer-mask is the tool that I wanted, but there are some problems a) meryl is slower than Jellfish and it uses too much memory( when much threads). b)some bugs need to fix for big fastq/fasta files .(I have the dirty patch(uint32->uint64), but it seems not active)

snurk commented 7 years ago

Dear @tseemann SPAdes uses iteratively increases value of K and additinaly tries to glue together potentially broken regions using paired read mapping and searching for small overlaps. Both these procedures add kmers, which have coverage 0 since they are not present in the reads. Also if Ns are introduced to scaffolds, then the total length of the scaffold might increase with its average kmer coverage decreasing. I hope this explains appearance of average kmer coverage <1.0 in the results. On the other hand, as far as I know SPAdes should not produce contigs shorter than k_max.

snurk commented 7 years ago

Dear @wangyugui

the --cov-cutoff of SPAdes is after assembly?

Yes and no. It happens after the assembly graph is constructed (and most graph simplification procedures finished). But the low covered edges are actually removed from the graph, leading to the compression of remaining unambiguous paths and not interfering with subsequent repeat resolution and scaffolding. The value auto is compatible only with uniform coverage model (no --meta or --mda flags). In this case the threshold is set automatically from the probabilistic model trained on kmer frequency histogram. In this case the value is chosen independently for every iteration. If the value is provided manually, it is interpreted as an "average nucleotide coverage" and will be multiplied by (RL - K)/RL to get a threshold on average kmer coverage for assembly iteration with kmer size K.

Dear @tseemann, I hope this answers your initial question, and I would be glad to provide any clarifications.

people may want a low k-mer coverage filter before assembly and to speed up the assembly.

We are considering adding this option in future, but currently you would have to set up your own pre-processing pipeline.

tseemann commented 7 years ago

@snurk thank you very much for responding with such detail to our questions. I'll pass this page onto the bacterial genomics community. And thank you for continuing to develop spades.

asl commented 7 years ago

@tseemann We will try to explain the k-mer coverage model SPAdes uses, if time permits. Though it's already used inside kmergenie :)

jacarrico commented 7 years ago

Yes thanks a lot for the answers and for developing Spades! This is fundamental for the kind of work we have been doing that includes certification of pipelines using spades

snurk commented 7 years ago

@tseemann, @jacarrico you are welcome!