Closed lsterck closed 6 years ago
This k-mer coverage profile is atypical, unless it's for transcriptome, metagenomics, or some other non-uniform sequence depth protocol. "Minimum k-mer coverage" is the first local minima of the k-mer coverage histogram. Starting from there, it sets the k-mer coverage threshold to the square root of the median k-mer coverage above the previous threshold, and iterates.
Hi Shaun,
thanks for the reply and apologies for thinking you might have overlooked this :-). however your answer also worries me a little, but perhaps I'd better bring this up in BioStar?
Hi Shaun,
sorry but I don't really get this I'm afraid :-/ , what exactly do you mean with "first local minima' ? If I look at the coverage.hist file of this run I can find lower local minima than the one reported. Also: why is this an atypical profile? I'm doing a (large and highly repetitive) plant genome. From doing some trial runs I noticed that this value changes with the k-mer I'm using. Up until 64 I have 2, from 77 on I have these high values (>330 ) ? I now selected an (what I think based on some metrics ) appropriate k-mer to use in my full run but your answer makes me worry that I might have chosen a wrong one?
I'm happy chatting here on GitHub.
In the k-mer coverage profile, 380 should be the first local minimum. If not, something has gone awry. You could gist your coverage.hist
file if you like, and I'll take a look at it. What ploidy is your genome?
NtCard is a useful tool for calculating the k-mer coverage profiles for many values of k very quickly. You can feed its output to GenomeScope (with a little reformatting) to estimate parameters of your genome (genome size, heterozygosity, repetitiveness), and inform your selection of k.
OK, ;-)
So how do you define a 'local minimum' then ? Sweet, I'll do that . https://gist.github.com/lsterck/199ca284c225df75b45a08bef3aee414 It should a be a normal diploid organism (even mainly the same haplotype) .
Yes, also benv has pointed me to those tools. I will have a look at them.
I had to look at the code to refresh my memory. I wrote this code nine years ago! https://github.com/bcgsc/abyss/blob/master/Common/Histogram.h#L193 The local minimum uses some smoothing. The value of the local minimum must be smaller than the next four values. Except for this smoothing, it looks as though your local minimum occurs at c=2, and your mode coverage occurs at c=3. In my experience, your mode should be above c=10, based on anecdotal evidence. I'd love to make a rigorous analysis of that observation one day. I'd suggest decreasing your value of k to increase your mode k-mer coverage.
Hi Shaun,
sorry to put you through xtra work (and revisiting old code) :-) yes, I figured that you would apply some smoothing, and i totally agree that's the right thing to do! However, I must admit you kinda lost me in your reasoning on the c value. Given the smoothing i do suspect the first minima is indeed around 380 or so (== what's reported in the abyss run).
in the meanwhile I went to work with ntCard and the genomescope website. For all of my coverage.hist (diff k-mer) file I provide it, it fails to converge and I thus get no sensible output from genomescope. The only time I did get some results was with a coverage file of an older run (years old, not as old as your code though :-) ) which used a diff subset of the data than what I'm using now. here is the result genomescope old run . It was more data then what I'm currently using. I now have a coverage file of my full run (k-mer 85) but that also fails to give results on genomescope. I'm now looking into if it some specific data file(s) that messes up the coverage profiles (I'm especially thinking of the merged single end files ?) I choose the k-mer 85 based on abyss-fac output comparison of several runs (diff k-mer) up to the unitig file.
n | n:200 | L50 | LG50 | NG50 | min | N80 | N50 | N20 | E-size | max | sum | name |
---|---|---|---|---|---|---|---|---|---|---|---|---|
145.1e6 | 20.57e6 | 3839013 | 20.57e6 | 200 | 200 | 303 | 677 | 1920 | 1254 | 20853 | 11.03e9 | k56/ppinTk56-unitigs.fa |
114.2e6 | 22.39e6 | 4015681 | 20.8e6 | 211 | 200 | 323 | 774 | 2078 | 1332 | 21489 | 12.85e9 | k64/ppinTk64-unitigs.fa |
91.92e6 | 24.08e6 | 4339969 | 16.67e6 | 266 | 200 | 333 | 821 | 2064 | 1317 | 22204 | 14.21e9 | k72/ppinTk72-unitigs.fa |
80.73e6 | 26.02e6 | 4630287 | 13.29e6 | 349 | 200 | 341 | 867 | 2124 | 1349 | 26448 | 15.77e9 | k80/ppinTk80-unitigs.fa |
71.93e6 | 26.81e6 | 4903128 | 12.75e6 | 378 | 200 | 344 | 860 | 2040 | 1301 | 30356 | 16.24e9 | k85/ppinTk85-unitigs.fa |
67.31e6 | 27.25e6 | 5086067 | 12.69e6 | 389 | 200 | 345 | 849 | 1975 | 1264 | 30356 | 16.44e9 | k88/ppinTk88-unitigs.fa |
56.8e6 | 28.29e6 | 5622730 | 13.11e6 | 398 | 200 | 344 | 803 | 1780 | 1153 | 27313 | 16.67e9 | k96/ppinTk96-unitigs.fa |
Here's an example of the last GenomeScope plot that I did for a 10 GB plant: http://qb.cshl.edu/genomescope/analysis.php?code=Tt3dme47ZFnwHPqVkqqT
What you want to see is pretty good separation of the good bump and the sequencing error bump.
Which tool did you use to merge the paired-end reads? For the purpose of plotting k-mer coverage profiles, I'd suggest using the raw reads before merging overlapping reads.
You can specify a comma-separated list of values of k to ntCard. For example:
ntcard -t16 -c1000 -k 24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144 -p ntcard *.fq.gz
I'd suggest plotting these curves, and see if any of them give you better separation. Small values of k can be useful for GenomeScope, even if they're not good for assembly.
Do you expect your individual to be very heterozygous? GenomeScope estimated het=0.115%, so it doesn't appear so. What's your estimated genome size, what's the ploidy, and are there any recent known whole genome duplications or hybridizations?
I see, that one indeed seems 'normal' I used BBmerge .Do you then mean 'cleaned/trimmed' but not merged or not even cleaned?
Ah, ok, so I can set diff k-mer for genomescope as for assembly? What does abyss use to do this coverage evaluation? is it the one I set to do assembly with (-k option) ?
yes and no :-) , the species itself is most likely quite heterozygous, but the DNA should (for the biggest fraction) come from the same haplotype (so much less heterozygosity is expected in my seq data). The estimated genome size is roughly somewhere around 25Gb, diploid (2N) and no, not really.
Ah, ok, so I can set diff k-mer for genomescope as for assembly?
Yes, for the purpose of estimating genome size, heterozygosity, and repeat content of your genome.
What does abyss use to do this coverage evaluation? is it the one I set to do assembly with (-k option) ?
Yes, ABySS uses the k-mer profile of the assembly k.
yes and no :-) , the species itself is most likely quite heterozygous, but the DNA should (for the biggest fraction) come from the same haplotype (so much less heterozygosity is expected in my seq data). The estimated genome size is roughly somewhere around 25Gb, diploid (2N) and no, not really.
Oh, neat. I was looking for an explanation for why the error hump is so fat. It could be the other rejected haplotype bleeding through. I'm curious; what technique was used to enrich for one haplotype?
What's your estimated sequencing depth? That is, number of reads * read length / estimated genome size.
Oh, neat. I was looking for an explanation for why the error hump is so fat. It could be the other rejected haplotype bleeding through. I'm curious; what technique was used to enrich for one haplotype?
Might indeed be to some extent. Ah, that's wetlab stuff : from what I understand the DNA was extracted from callus tissue from a single gametophyte (or something like that :-) ) . So it's not really 'enriched', there is only one present.
What's your estimated sequencing depth? That is, number of reads * read length / estimated genome size.
To be honest, I did not exactly count that :-/ , I have roughly 10Tbase of data, so I estimate around 30-40x
Did you mean 1 Tbase of data? 30–40 fold coverage is a bit on the low side for assembly of a diploid organism. You may be able to get away with lower coverage since you have haploid tissue. I usually recommend 50x to 100x for assembly of a diploid genome. 30–40 fold coverage is enough to get an assembly, but sequencing coverage gaps may limit contiguity.
Do you then mean 'cleaned/trimmed' but not merged or not even cleaned?
Trimmed for adapters is fine, though I didn't notice that it made much difference to Genomescope. I usually don't trip for quality values, unless I need to to reduce the memory usage of ABySS, which may be relevant for a 25 Gbp genome. Have you seen that ABySS 2.0 reduces the memory usage by about 10 fold over ABySS 1.x?
Did you mean 1 Tbase of data? 30–40 fold coverage is a bit on the low side for assembly of a diploid organism. You may be able to get away with lower coverage since you have haploid tissue. I usually recommend 50x to 100x for assembly of a diploid genome. 30–40 fold coverage is enough to get an assembly, but sequencing coverage gaps may limit contiguity.
oeps, yes indeed, 1Tbase ;-) yes, I was always suspecting it was on the low end unfortunately but I also put my hope in the haploid tissue 'advantage' I looked up the counts on my raw reads which add up to ~2Tbase of seq (but uncleaned).
2 Tbase of raw coverage of a 25 Gbp genome is 80 fold raw coverage, which is just fine, particularly for haploid tissue.
thanks for reassuring me (and getting my hopes up :-) )
To get back to the beginning of this thread. Do you think there is now something off in the run I'm doing now (using k-mer of 85)? Should I be concerned about this 'high' minimum k-mer coverage ?
Also : if I can change the k-mer for genomescope and assembly, what can I then learn from that output (k-mer wise that is then) ?
Will I be better off when I provide the data 'unmerged' , simply as single end files?
To get back to the beginning of this thread. Do you think there is now something off in the run I'm doing now (using k-mer of 85)?
Yes, I do think something is off with the data. Figuring out exactly what that is may not be easy, or even possible, and learning what the root issue may or may not help you get a better assembly.
Should I be concerned about this 'high' minimum k-mer coverage ?
Yes, it may not produce a very contiguous assembly. I'd be concerned if GenomeScope fails to converge.
Also : if I can change the k-mer for genomescope and assembly, what can I then learn from that output (k-mer wise that is then) ?
A different value of k for GenomeScope may be better for estimating parameters of the genome, like genome size and estimated heterozygosity. Smaller values of k may be better for GenomeScope, when larger values may be better for ABySS.
I suggest looking at the ntCard profile, and see if any value of k gives a typical-looking curve.
Is it OK to assume that as long as genomescope converges for a certain Kmer it's ok to use that for ABySS as well? (given that is also gives a more or less typical-looking curve)
Okay in that you'll likely get an assembly, yes. It may not be the optimal value of k. I'd really like to be able to better interpret the output of Genomescope to predict which values of k are likely to produce the best assembly.
I think I might have traced the 'problem' back to erroneous k-mers derived from the MP libraries. Does that sound familiar? If I use the same k-mer in a GenomeScope estimation I can see the 'error' bump increasing when I gradually add the MP data, so I figured it must be that Why I added this in the unitig stage you might ask?: I added them as se for the unitig and contig stage. I was under the impression that's a common/logical thing to do? However it turns out I might have to do serious additional cleaning on the MP data (even for us in se-mode?)
Are they Nextera mate pair libraries? If so, you definitely need to run nxTrim first. Even just to use them for scaffolding, but definitely if you plan to use them for sequence in the de Bruijn graph. If they're not Nextera mate pair libraries, I do not recommend using them for de Bruijn graph sequence, due to the chimeric circular reads. See https://github.com/sjackman/abyss-drosophila-melanogaster#readme
thanks. yes, I stumbled on the nxTrim pre-processing approach myself. I'm now digging up the meta-data fro the libraries to see the kit specifications. If they are nextera I'll certainly give the nxTrim a go.
an interesting read! is that a homozygous dataset or not? I currently have a similar read data setup (100bp) and I was kinda surprised to see your optimal k value being 48. Perhaps I have to enlarge my optimisation range then (I started at +- 60 , as I was under the impression the lowest k was 2/3 read length :/ )
I believe that data set was diploid.
There is no hard lower limit for k. In my experience, with typical data sets, the optimal value of k is often around 2/3 the read length, but it varies widely depending on the data set. Your best bet is to try a few different assemblies and pick the best k based on the largest NG50 (if you don't have a close reference to estimate assembly correctness).
Hi,
I was going through the runtime output of abyss and stumbled on this phrase: "Minimum k-mer coverage is 380" here is the context:
If I interpret this correctly it tries to find the coverage threshold to use and the table seems to indicate it's starting from the max coverage and then goes down? What exactly does this mean, and is this possible it should read "Maximum K-mer coverage" ?
thanks. L.