bcgsc / ntCard

Estimating k-mer coverage histogram of genomics data
http://www.bcgsc.ca/platform/bioinfo/software/ntcard
MIT License
76 stars 9 forks source link

GenomeScope does not accept hist files from ntCard #8

Closed mmokrejs closed 6 years ago

mmokrejs commented 6 years ago

Hi, @sjackman advised me to run ntCard on my raw data in https://github.com/bcgsc/abyss/issues/178#issuecomment-364214206 . Are you sure the output files can be parsed by GenomeScope http://qb.cshl.edu/genomescope/ ? This is a fish genome, probably 1.7Gbp.

It complains with: File was uploaded but it has 1 column(s). The file must have 2 columns separated by a single space, which is the default in Jellyfish. I admit I used current 'git master' version.

The command was: ntcard -t "$threads" -k32,64,96,112,128,144,156 -c $num .... I used num=512 for the PE+MP files and num=1000 for the PE-only data. Probably is a good idea to omit MP-datasets as the library diversity may be low, right?

tt_16D1C3L12PE-onlyntCard_k32.hist.txt tt_16D1C3L12PE-onlyntCard_k64.hist.txt tt_16D1C3L12PE-onlyntCard_k96.hist.txt tt_16D1C3L12PE-onlyntCard_k112.hist.txt tt_16D1C3L12PE-onlyntCard_k128.hist.txt tt_16D1C3L12PE-onlyntCard_k144.hist.txt tt_16D1C3L12PE-onlyntCard_k156.hist.txt tt_16D1C3L12__PE_MP__ntCard_k32.hist.txt tt_16D1C3L12__PE_MP__ntCard_k64.hist.txt tt_16D1C3L12__PE_MP__ntCard_k96.hist.txt tt_16D1C3L12__PE_MP__ntCard_k112.hist.txt tt_16D1C3L12__PE_MP__ntCard_k128.hist.txt tt_16D1C3L12__PE_MP__ntCard_k144.hist.txt tt_16D1C3L12__PE_MP__ntCard_k156.hist.txt

mmokrejs commented 6 years ago

BTW, here is how it performed on one cluster node with 24 physical cores and 128GB RAM. It is nice it was able to read the input data at 600MB/s in the beginning.

ntcard__24threads__lustrefs_usage ntcard__24threads__memory_usage ntcard__24threads__cpu_usage ntcard__24threads__system_load

resources_used.cpupercent=659
resources_used.cput=02:01:19
resources_used.mem=3674224kb
resources_used.ncpus=24
resources_used.walltime=00:54:20
sjackman commented 6 years ago

You have to adjust the file format slightly. Change…

F1  1076739332427
F0  72176413083
f1  61726639195
f2  3305377279
f3  775645151

to

1 61726639195
2 3305377279
3 775645151

Delete lines F1 and F0, delete the characters f, and change tabs to spaces.

mohamadi commented 6 years ago

Thanks @sjackman! @mmokrejs please let me know if you need help on ntCard.

mmokrejs commented 6 years ago

Thank you. You meant tabs to spaces I guess.

$ cat ntCard_hist2histo.sh
#!  /bin/sh
tail -n +3 $1 | sed -e "s/^f//" | sed -e "s/\t/ /"
$

I should learn now how to interpret the figures. I should probably stitch together an animated GIF file. I see a haploid and diploid peak. What should I be after regarding assembly k-mer size? The minimum errror frequency where there is the minumum of errors and of full model?

genomescope_URLs.txt

mmokrejs commented 6 years ago

Could there be a download option of the results? Firefox's "Save page" failed and stored some skeleton of the page, and wget --mirror -l2 $url also. It is boring to fetch the images manually, though doable.

sjackman commented 6 years ago

I don't develop GenomeScope. I suggest contacting its authors.

sjackman commented 6 years ago

Good looking curves. That's a fair amount of heterozygous and repeat sequence. An awful lot of repeat sequence.

sjackman commented 6 years ago

I see that you have a k=156 plot. How long are your reads? Based on the GenomeScope plot, you seem to have enough coverage for assembly at k=156.

mmokrejs commented 6 years ago

They are 2x250 Illumina, but maybe the size should be somewhat correct for mate-pair reads because they need to be split via tthe linker. I have no idea why is this really imporant for calculations for the GenomeScope plot.

So what is normal/common amount of heterozygosity? I though this fish is rather at the lower values. BTW, I saw in an example http://qb.cshl.edu/genomescope/analysis.php?code=example3 much more. Would you be please more verbose what the figures really mean to you, what should I look for in them, etc.? The peaks at k=156 seem ugly and are not separated from each other. Does it matter? You only care if the "k-coverage" is above a certain value? Do you see a difference Iif I used just the paired-end read data or when I added mate-pair data? What is the "proper" approach?

sjackman commented 6 years ago

Here's some recent measurements heterozygosity. I agree that your fish seems on the low side for fish. https://support.10xgenomics.com/de-novo-assembly/software/overview/latest/performance

I'm still learning to interpret k-mer coverage histograms. It's an open problem in my opinion, picking good values of k for assembly from k-mer coverage histograms. I look for good separation between the first error hump and the next hump, and the local minimum is somewhere around or less than 5. I'm still not sure what leads to larger values for the minimum. Possibly PCR errors or other library construction artifacts.

An assembly with mode k-mer coverage less than 10 probably won't work well. Between 10 and 20 is a bit of a grey area, but may be okay. Above 20 is usually fine. I mostly use this criterion to set an upper limit of values of k that I think will work for assembly. You can start for example with the largest value of k that gives you ≥ 10 mode k-mer coverage.

I compare the sizes of the heterozygous and homozygous humps to give an idea of how much genome you're expecting to assemble, between a haploid and diploid genome. For example, if you're haploid genome size is 1 GB, but your heterozygous genome hump is 10x that of your homozygous genome hump, you expect to assembly something closer to 2 GB of sequence.

mohamadi commented 6 years ago

@mmokrejs Hi Martin, if you need help on ntCard output to be used for GenomeScope please let me know. I can add an option to generate the output histograms for GenomeScope.

Best, Hamid