ablab / spades

SPAdes Genome Assembler
http://ablab.github.io/spades/
Other
731 stars 134 forks source link

Contig coverage less than 1? #32

Closed biocyberman closed 6 years ago

biocyberman commented 6 years ago

I ran spades 3.10 as following: SPAdes-3.10.0-Linux/bin/spades.py -t 10 -k21,31,51,71 -o spadeswref --trusted-contigs ./virusgt1.fasta -s Hepacivirus.s.fastq --iontorrent

The longest contig I got has this header: >NODE_1_length_9615_cov_0.109807 And there is no 'N' gap in the contig. My question is, how can a contig has converage less than 1, equal 0.1? Does it mean spades take sequences from the trusted contigs virusgt1.fasta?

In a similar run, I got longest contig around 3kb while the trusted contig is about 10kb. Therefore I thought spades does not incorporate bases from the trusted contigs. Am I wrong?

Essentially I want to try and use spades as a reference-assisted de novo assembler

asl commented 6 years ago

Hello

My question is, how can a contig has converage less than 1, equal 0.1? Does it mean spades take sequences from the trusted contigs virusgt1.fasta?

SPAdes reports k-mer coverage for the last k-mer. So, the coverage less than 1 indicates that the contig was assembled using either shorter k-mers or nucls from the trusted contigs.

Essentially I want to try and use spades as a reference-assisted de novo assembler

You should not. Per SPAdes (http://cab.spbu.ru/files/release3.11.0/manual.html#sec3.2) manual:

Additional contigs

In case you have contigs of the same genome generated by other assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using --trusted-contigs or --untrusted-contigs. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited.

Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified.

biocyberman commented 6 years ago

@asl Thanks for your clarifications.

So, the coverage less than 1 indicates that the contig was assembled using either shorter k-mers or nucls from the trusted contigs.

Please correct me if I am wrong:

The trusted contigs I used is a full genome of HCV virus downloaded from NCBI. And the sample I work with is also HCV virus sample. That means the trusted contigs is about >~97% identical to my data. Does this make a valid use of trust contig? I guess not, because there are still gaps that spades try to fill with nucls from trusted contigs, that is cheating in this case.

However, if on BAM file generated by alignment, I can see that there are not gaps, can I use spades as a tool to generate consensus sequence? I would think this is valid because with enough coverage, the 'trusted-contigs' will be out-weighted in B mismatch resolution.

asl commented 6 years ago

The trusted contigs I used is a full genome of HCV virus downloaded from NCBI. And the sample I work with is also HCV virus sample. That means the trusted contigs is about >~97% identical to my data. Does this make a valid use of trust contig?

No

However, if on BAM file generated by alignment, I can see that there are not gaps, can I use spades as a tool to generate consensus sequence?

No

biocyberman commented 6 years ago

How deadly it ends, but thanks :)