lapsus in abyss runtime output?

lsterck commented 6 years ago

Hi,

I was going through the runtime output of abyss and stumbled on this phrase: "Minimum k-mer coverage is 380" here is the context:

12: Hash load: 481038772 / 1073741824 = 0.448 using 20.7 GB
Loaded 24053974559 k-mer. At least 962 GB of RAM is required.
Minimum k-mer coverage is 380
0: Coverage: 380        Reconstruction: 7293863
0: Coverage: 26.3       Reconstruction: 160202733
0: Coverage: 6.63       Reconstruction: 3326829548
0: Coverage: 2.83       Reconstruction: 14105558387
0: Coverage: 2.24       Reconstruction: 16732778217
0: Coverage: 2  Reconstruction: 16732778217
Using a coverage threshold of 2...

If I interpret this correctly it tries to find the coverage threshold to use and the table seems to indicate it's starting from the max coverage and then goes down? What exactly does this mean, and is this possible it should read "Maximum K-mer coverage" ?

thanks. L.

sjackman commented 6 years ago

This k-mer coverage profile is atypical, unless it's for transcriptome, metagenomics, or some other non-uniform sequence depth protocol. "Minimum k-mer coverage" is the first local minima of the k-mer coverage histogram. Starting from there, it sets the k-mer coverage threshold to the square root of the median k-mer coverage above the previous threshold, and iterates.

lsterck commented 6 years ago

Hi Shaun,

thanks for the reply and apologies for thinking you might have overlooked this :-). however your answer also worries me a little, but perhaps I'd better bring this up in BioStar?

lsterck commented 6 years ago

Hi Shaun,

sorry but I don't really get this I'm afraid :-/ , what exactly do you mean with "first local minima' ? If I look at the coverage.hist file of this run I can find lower local minima than the one reported. Also: why is this an atypical profile? I'm doing a (large and highly repetitive) plant genome. From doing some trial runs I noticed that this value changes with the k-mer I'm using. Up until 64 I have 2, from 77 on I have these high values (>330 ) ? I now selected an (what I think based on some metrics ) appropriate k-mer to use in my full run but your answer makes me worry that I might have chosen a wrong one?

sjackman commented 6 years ago

I'm happy chatting here on GitHub.

In the k-mer coverage profile, 380 should be the first local minimum. If not, something has gone awry. You could gist your coverage.hist file if you like, and I'll take a look at it. What ploidy is your genome?

NtCard is a useful tool for calculating the k-mer coverage profiles for many values of k very quickly. You can feed its output to GenomeScope (with a little reformatting) to estimate parameters of your genome (genome size, heterozygosity, repetitiveness), and inform your selection of k.

lsterck commented 6 years ago

OK, ;-)

So how do you define a 'local minimum' then ? Sweet, I'll do that . https://gist.github.com/lsterck/199ca284c225df75b45a08bef3aee414 It should a be a normal diploid organism (even mainly the same haplotype) .

Yes, also benv has pointed me to those tools. I will have a look at them.

sjackman commented 6 years ago

I had to look at the code to refresh my memory. I wrote this code nine years ago! https://github.com/bcgsc/abyss/blob/master/Common/Histogram.h#L193 The local minimum uses some smoothing. The value of the local minimum must be smaller than the next four values. Except for this smoothing, it looks as though your local minimum occurs at c=2, and your mode coverage occurs at c=3. In my experience, your mode should be above c=10, based on anecdotal evidence. I'd love to make a rigorous analysis of that observation one day. I'd suggest decreasing your value of k to increase your mode k-mer coverage.

lsterck commented 6 years ago

Hi Shaun,

sorry to put you through xtra work (and revisiting old code) :-) yes, I figured that you would apply some smoothing, and i totally agree that's the right thing to do! However, I must admit you kinda lost me in your reasoning on the c value. Given the smoothing i do suspect the first minima is indeed around 380 or so (== what's reported in the abyss run).

in the meanwhile I went to work with ntCard and the genomescope website. For all of my coverage.hist (diff k-mer) file I provide it, it fails to converge and I thus get no sensible output from genomescope. The only time I did get some results was with a coverage file of an older run (years old, not as old as your code though :-) ) which used a diff subset of the data than what I'm using now. here is the result genomescope old run . It was more data then what I'm currently using. I now have a coverage file of my full run (k-mer 85) but that also fails to give results on genomescope. I'm now looking into if it some specific data file(s) that messes up the coverage profiles (I'm especially thinking of the merged single end files ?) I choose the k-mer 85 based on abyss-fac output comparison of several runs (diff k-mer) up to the unitig file.

n	n:200	L50	LG50	NG50	min	N80	N50	N20	E-size	max	sum	name
145.1e6	20.57e6	3839013	20.57e6	200	200	303	677	1920	1254	20853	11.03e9	k56/ppinTk56-unitigs.fa
114.2e6	22.39e6	4015681	20.8e6	211	200	323	774	2078	1332	21489	12.85e9	k64/ppinTk64-unitigs.fa
91.92e6	24.08e6	4339969	16.67e6	266	200	333	821	2064	1317	22204	14.21e9	k72/ppinTk72-unitigs.fa
80.73e6	26.02e6	4630287	13.29e6	349	200	341	867	2124	1349	26448	15.77e9	k80/ppinTk80-unitigs.fa
71.93e6	26.81e6	4903128	12.75e6	378	200	344	860	2040	1301	30356	16.24e9	k85/ppinTk85-unitigs.fa
67.31e6	27.25e6	5086067	12.69e6	389	200	345	849	1975	1264	30356	16.44e9	k88/ppinTk88-unitigs.fa
56.8e6	28.29e6	5622730	13.11e6	398	200	344	803	1780	1153	27313	16.67e9	k96/ppinTk96-unitigs.fa

sjackman commented 6 years ago

Here's an example of the last GenomeScope plot that I did for a 10 GB plant: http://qb.cshl.edu/genomescope/analysis.php?code=Tt3dme47ZFnwHPqVkqqT

What you want to see is pretty good separation of the good bump and the sequencing error bump.

Which tool did you use to merge the paired-end reads? For the purpose of plotting k-mer coverage profiles, I'd suggest using the raw reads before merging overlapping reads.

You can specify a comma-separated list of values of k to ntCard. For example:

    ntcard -t16 -c1000 -k 24,32,40,48,56,64,72,80,88,96,104,112,120,128,136,144 -p ntcard *.fq.gz

I'd suggest plotting these curves, and see if any of them give you better separation. Small values of k can be useful for GenomeScope, even if they're not good for assembly.

Do you expect your individual to be very heterozygous? GenomeScope estimated het=0.115%, so it doesn't appear so. What's your estimated genome size, what's the ploidy, and are there any recent known whole genome duplications or hybridizations?

lsterck commented 6 years ago

I see, that one indeed seems 'normal' I used BBmerge .Do you then mean 'cleaned/trimmed' but not merged or not even cleaned?

Ah, ok, so I can set diff k-mer for genomescope as for assembly? What does abyss use to do this coverage evaluation? is it the one I set to do assembly with (-k option) ?

yes and no :-) , the species itself is most likely quite heterozygous, but the DNA should (for the biggest fraction) come from the same haplotype (so much less heterozygosity is expected in my seq data). The estimated genome size is roughly somewhere around 25Gb, diploid (2N) and no, not really.

sjackman commented 6 years ago

Ah, ok, so I can set diff k-mer for genomescope as for assembly?

Yes, for the purpose of estimating genome size, heterozygosity, and repeat content of your genome.

What does abyss use to do this coverage evaluation? is it the one I set to do assembly with (-k option) ?

Yes, ABySS uses the k-mer profile of the assembly k.

yes and no :-) , the species itself is most likely quite heterozygous, but the DNA should (for the biggest fraction) come from the same haplotype (so much less heterozygosity is expected in my seq data). The estimated genome size is roughly somewhere around 25Gb, diploid (2N) and no, not really.

Oh, neat. I was looking for an explanation for why the error hump is so fat. It could be the other rejected haplotype bleeding through. I'm curious; what technique was used to enrich for one haplotype?

What's your estimated sequencing depth? That is, number of reads * read length / estimated genome size.

lsterck commented 6 years ago

Oh, neat. I was looking for an explanation for why the error hump is so fat. It could be the other rejected haplotype bleeding through. I'm curious; what technique was used to enrich for one haplotype?

Might indeed be to some extent. Ah, that's wetlab stuff : from what I understand the DNA was extracted from callus tissue from a single gametophyte (or something like that :-) ) . So it's not really 'enriched', there is only one present.

What's your estimated sequencing depth? That is, number of reads * read length / estimated genome size.

To be honest, I did not exactly count that :-/ , I have roughly 10Tbase of data, so I estimate around 30-40x

sjackman commented 6 years ago

Did you mean 1 Tbase of data? 30–40 fold coverage is a bit on the low side for assembly of a diploid organism. You may be able to get away with lower coverage since you have haploid tissue. I usually recommend 50x to 100x for assembly of a diploid genome. 30–40 fold coverage is enough to get an assembly, but sequencing coverage gaps may limit contiguity.

sjackman commented 6 years ago

Do you then mean 'cleaned/trimmed' but not merged or not even cleaned?

Trimmed for adapters is fine, though I didn't notice that it made much difference to Genomescope. I usually don't trip for quality values, unless I need to to reduce the memory usage of ABySS, which may be relevant for a 25 Gbp genome. Have you seen that ABySS 2.0 reduces the memory usage by about 10 fold over ABySS 1.x?

lsterck commented 6 years ago

Did you mean 1 Tbase of data? 30–40 fold coverage is a bit on the low side for assembly of a diploid organism. You may be able to get away with lower coverage since you have haploid tissue. I usually recommend 50x to 100x for assembly of a diploid genome. 30–40 fold coverage is enough to get an assembly, but sequencing coverage gaps may limit contiguity.

oeps, yes indeed, 1Tbase ;-) yes, I was always suspecting it was on the low end unfortunately but I also put my hope in the haploid tissue 'advantage' I looked up the counts on my raw reads which add up to ~2Tbase of seq (but uncleaned).

sjackman commented 6 years ago

2 Tbase of raw coverage of a 25 Gbp genome is 80 fold raw coverage, which is just fine, particularly for haploid tissue.

lsterck commented 6 years ago

thanks for reassuring me (and getting my hopes up :-) )

To get back to the beginning of this thread. Do you think there is now something off in the run I'm doing now (using k-mer of 85)? Should I be concerned about this 'high' minimum k-mer coverage ?

Also : if I can change the k-mer for genomescope and assembly, what can I then learn from that output (k-mer wise that is then) ?

Will I be better off when I provide the data 'unmerged' , simply as single end files?

sjackman commented 6 years ago

To get back to the beginning of this thread. Do you think there is now something off in the run I'm doing now (using k-mer of 85)?

Yes, I do think something is off with the data. Figuring out exactly what that is may not be easy, or even possible, and learning what the root issue may or may not help you get a better assembly.

Should I be concerned about this 'high' minimum k-mer coverage ?

Yes, it may not produce a very contiguous assembly. I'd be concerned if GenomeScope fails to converge.

Also : if I can change the k-mer for genomescope and assembly, what can I then learn from that output (k-mer wise that is then) ?

A different value of k for GenomeScope may be better for estimating parameters of the genome, like genome size and estimated heterozygosity. Smaller values of k may be better for GenomeScope, when larger values may be better for ABySS.

I suggest looking at the ntCard profile, and see if any value of k gives a typical-looking curve.

lsterck commented 6 years ago

Is it OK to assume that as long as genomescope converges for a certain Kmer it's ok to use that for ABySS as well? (given that is also gives a more or less typical-looking curve)

sjackman commented 6 years ago

Okay in that you'll likely get an assembly, yes. It may not be the optimal value of k. I'd really like to be able to better interpret the output of Genomescope to predict which values of k are likely to produce the best assembly.

lsterck commented 6 years ago

I think I might have traced the 'problem' back to erroneous k-mers derived from the MP libraries. Does that sound familiar? If I use the same k-mer in a GenomeScope estimation I can see the 'error' bump increasing when I gradually add the MP data, so I figured it must be that Why I added this in the unitig stage you might ask?: I added them as se for the unitig and contig stage. I was under the impression that's a common/logical thing to do? However it turns out I might have to do serious additional cleaning on the MP data (even for us in se-mode?)

sjackman commented 6 years ago

Are they Nextera mate pair libraries? If so, you definitely need to run nxTrim first. Even just to use them for scaffolding, but definitely if you plan to use them for sequence in the de Bruijn graph. If they're not Nextera mate pair libraries, I do not recommend using them for de Bruijn graph sequence, due to the chimeric circular reads. See https://github.com/sjackman/abyss-drosophila-melanogaster#readme

lsterck commented 6 years ago

thanks. yes, I stumbled on the nxTrim pre-processing approach myself. I'm now digging up the meta-data fro the libraries to see the kit specifications. If they are nextera I'll certainly give the nxTrim a go.

an interesting read! is that a homozygous dataset or not? I currently have a similar read data setup (100bp) and I was kinda surprised to see your optimal k value being 48. Perhaps I have to enlarge my optimisation range then (I started at +- 60 , as I was under the impression the lowest k was 2/3 read length :/ )

sjackman commented 6 years ago

I believe that data set was diploid.

There is no hard lower limit for k. In my experience, with typical data sets, the optimal value of k is often around 2/3 the read length, but it varies widely depending on the data set. Your best bet is to try a few different assemblies and pick the best k based on the largest NG50 (if you don't have a close reference to estimate assembly correctness).

bcgsc / abyss

lapsus in abyss runtime output? #181