Closed tseemann closed 7 years ago
'a positive float number' is explaned in help output.
what is the meaning of 0--1.0 and over 1.0?
what is the meaning of 0--1.0 and over 1.0?
See http://cab.spbu.ru/files/release3.10.1/manual.html#sec3.5 that explains what is the coverage reported.
Does cov-cutoff is used to filter contig ouput? It is not used for filter fastq inut by kmer coverage?
Here is what the manual section 3.5 says:
Contigs/scaffolds names in SPAdes output FASTA files have the following format:
>NODE_3_length_237403_cov_243.207_ID_45
Here 3 is the number of the contig/scaffold, 237403 is the sequence length in nucleotides and 243.207 is the k-mer coverage for the last (largest) k value used.
Note that the k-mer coverage is always lower than the read (per-base) coverage.
The only way to get k-mer coverage < 1 is to have a contig which is less than the k_max ?
(which can happen in a section of a de bruijn graph when breaking into contigs)
the --cov-cutoff of SPAdes is after assembly? people may want a low k-mer coverage filter before assembly and to speed up the assembly.
kmer-mask is the tool that I wanted, but there are some problems a) meryl is slower than Jellfish and it uses too much memory( when much threads). b)some bugs need to fix for big fastq/fasta files .(I have the dirty patch(uint32->uint64), but it seems not active)
Dear @tseemann
SPAdes uses iteratively increases value of K and additinaly tries to glue together potentially broken regions using paired read mapping and searching for small overlaps.
Both these procedures add kmers, which have coverage 0
since they are not present in the reads.
Also if Ns are introduced to scaffolds, then the total length of the scaffold might increase with its average kmer coverage decreasing.
I hope this explains appearance of average kmer coverage <1.0
in the results.
On the other hand, as far as I know SPAdes should not produce contigs shorter than k_max.
Dear @wangyugui
the --cov-cutoff of SPAdes is after assembly?
Yes and no. It happens after the assembly graph is constructed (and most graph simplification procedures finished). But the low covered edges are actually removed from the graph, leading to the compression of remaining unambiguous paths and not interfering with subsequent repeat resolution and scaffolding.
The value auto
is compatible only with uniform coverage model (no --meta or --mda flags).
In this case the threshold is set automatically from the probabilistic model trained on kmer frequency histogram. In this case the value is chosen independently for every iteration.
If the value is provided manually, it is interpreted as an "average nucleotide coverage" and will be multiplied by (RL - K)/RL
to get a threshold on average kmer coverage for assembly iteration with kmer size K.
Dear @tseemann, I hope this answers your initial question, and I would be glad to provide any clarifications.
people may want a low k-mer coverage filter before assembly and to speed up the assembly.
We are considering adding this option in future, but currently you would have to set up your own pre-processing pipeline.
@snurk thank you very much for responding with such detail to our questions. I'll pass this page onto the bacterial genomics community. And thank you for continuing to develop spades.
@tseemann We will try to explain the k-mer coverage model SPAdes uses, if time permits. Though it's already used inside kmergenie :)
Yes thanks a lot for the answers and for developing Spades! This is fundamental for the kind of work we have been doing that includes certification of pipelines using spades
@tseemann, @jacarrico you are welcome!
The
cov-cutoff
parameter remains a mystery to the Spades user community. It used to beauto
and now it isoff
.Would it be possible to add an explaination to the document explaining it?
Common results are getting contigs with coverages of < 1.0