Closed mmokrejs closed 6 years ago
BTW, here is how it performed on one cluster node with 24 physical cores and 128GB RAM. It is nice it was able to read the input data at 600MB/s in the beginning.
resources_used.cpupercent=659
resources_used.cput=02:01:19
resources_used.mem=3674224kb
resources_used.ncpus=24
resources_used.walltime=00:54:20
You have to adjust the file format slightly. Change…
F1 1076739332427
F0 72176413083
f1 61726639195
f2 3305377279
f3 775645151
to
1 61726639195
2 3305377279
3 775645151
Delete lines F1
and F0
, delete the characters f
, and change tabs to spaces.
Thanks @sjackman! @mmokrejs please let me know if you need help on ntCard.
Thank you. You meant tabs to spaces I guess.
$ cat ntCard_hist2histo.sh
#! /bin/sh
tail -n +3 $1 | sed -e "s/^f//" | sed -e "s/\t/ /"
$
I should learn now how to interpret the figures. I should probably stitch together an animated GIF file. I see a haploid and diploid peak. What should I be after regarding assembly k-mer size? The minimum errror frequency where there is the minumum of errors
and of full model
?
Could there be a download option of the results? Firefox's "Save page" failed and stored some skeleton of the page, and wget --mirror -l2 $url
also. It is boring to fetch the images manually, though doable.
I don't develop GenomeScope. I suggest contacting its authors.
Good looking curves. That's a fair amount of heterozygous and repeat sequence. An awful lot of repeat sequence.
I see that you have a k=156 plot. How long are your reads? Based on the GenomeScope plot, you seem to have enough coverage for assembly at k=156.
They are 2x250 Illumina, but maybe the size should be somewhat correct for mate-pair reads because they need to be split via tthe linker. I have no idea why is this really imporant for calculations for the GenomeScope plot.
So what is normal/common amount of heterozygosity? I though this fish is rather at the lower values. BTW, I saw in an example http://qb.cshl.edu/genomescope/analysis.php?code=example3 much more. Would you be please more verbose what the figures really mean to you, what should I look for in them, etc.? The peaks at k=156 seem ugly and are not separated from each other. Does it matter? You only care if the "k-coverage" is above a certain value? Do you see a difference Iif I used just the paired-end read data or when I added mate-pair data? What is the "proper" approach?
Here's some recent measurements heterozygosity. I agree that your fish seems on the low side for fish. https://support.10xgenomics.com/de-novo-assembly/software/overview/latest/performance
I'm still learning to interpret k-mer coverage histograms. It's an open problem in my opinion, picking good values of k for assembly from k-mer coverage histograms. I look for good separation between the first error hump and the next hump, and the local minimum is somewhere around or less than 5. I'm still not sure what leads to larger values for the minimum. Possibly PCR errors or other library construction artifacts.
An assembly with mode k-mer coverage less than 10 probably won't work well. Between 10 and 20 is a bit of a grey area, but may be okay. Above 20 is usually fine. I mostly use this criterion to set an upper limit of values of k that I think will work for assembly. You can start for example with the largest value of k that gives you ≥ 10 mode k-mer coverage.
I compare the sizes of the heterozygous and homozygous humps to give an idea of how much genome you're expecting to assemble, between a haploid and diploid genome. For example, if you're haploid genome size is 1 GB, but your heterozygous genome hump is 10x that of your homozygous genome hump, you expect to assembly something closer to 2 GB of sequence.
@mmokrejs Hi Martin, if you need help on ntCard output to be used for GenomeScope please let me know. I can add an option to generate the output histograms for GenomeScope.
Best, Hamid
Hi, @sjackman advised me to run ntCard on my raw data in https://github.com/bcgsc/abyss/issues/178#issuecomment-364214206 . Are you sure the output files can be parsed by GenomeScope http://qb.cshl.edu/genomescope/ ? This is a fish genome, probably 1.7Gbp.
It complains with:
File was uploaded but it has 1 column(s). The file must have 2 columns separated by a single space, which is the default in Jellyfish
. I admit I used current 'git master' version.The command was:
ntcard -t "$threads" -k32,64,96,112,128,144,156 -c $num ...
. I used num=512 for the PE+MP files and num=1000 for the PE-only data. Probably is a good idea to omit MP-datasets as the library diversity may be low, right?tt_16D1C3L12PE-onlyntCard_k32.hist.txt tt_16D1C3L12PE-onlyntCard_k64.hist.txt tt_16D1C3L12PE-onlyntCard_k96.hist.txt tt_16D1C3L12PE-onlyntCard_k112.hist.txt tt_16D1C3L12PE-onlyntCard_k128.hist.txt tt_16D1C3L12PE-onlyntCard_k144.hist.txt tt_16D1C3L12PE-onlyntCard_k156.hist.txt tt_16D1C3L12__PE_MP__ntCard_k32.hist.txt tt_16D1C3L12__PE_MP__ntCard_k64.hist.txt tt_16D1C3L12__PE_MP__ntCard_k96.hist.txt tt_16D1C3L12__PE_MP__ntCard_k112.hist.txt tt_16D1C3L12__PE_MP__ntCard_k128.hist.txt tt_16D1C3L12__PE_MP__ntCard_k144.hist.txt tt_16D1C3L12__PE_MP__ntCard_k156.hist.txt