TGAC / KAT

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
http://www.earlham.ac.uk/kat-tools
GNU General Public License v3.0
206 stars 52 forks source link

Large, plant genome & multiple input files #116

Closed AllisonStander closed 5 years ago

AllisonStander commented 5 years ago

Good day,

I would like to use KAT for k-mer analysis on my sequenced genome. The genome has an estimated size of 2.5 Gbp, has high heterozygosity, and repetitive regions.

My data is from one Illumina paired-end library, and two mate-pair libraries. Data for the paired-end library is in multiple (7) fastq.gz files. All the data amounts to 210 G.

I have access to 36 cores and 500 gb of ram.

My question is: Would KAT be able to handle the size of my data set, and the multiple input files?

Thank you for your time.

Kind regards, Allison

bjclavijo commented 5 years ago

Hi Allison, to assess your assembly you should only use the paired-end library, which is normally much less biased than LMPs. It is not clear to me if the 210G is the disk size of your compressed files, but I guess it is. If the PE library ammounts to more than 250Gbp (i.e. more than 50x of coverage per haplotype) on its own, I would not use some, or all, of the R2 files to decrease memory and CPU usage. Usually R1 is of higher quality and thus produces cleaner spectra, and you won't gain very little extra insight from doing you kat spectra-cn with anything more than 50x per haplotype of you genome.

Best,

bj

AllisonStander commented 5 years ago

I checked, and the PE data is 187 GB compressed.

Thank you so much for the advice and quick response!

Kind regards,

Allison

AllisonStander commented 5 years ago

I had a successful run with all of my paired-end data with k=27.

To see whether I can get better results, I changed k=31, and added -d to save any jellyfish hashes to disk that were produced.

This run then ran out of memory:

../deps/seqan-library-2.0.0/include/seqan/basic/basic_exception.h:368 FAILED!  (Uncaught exception of type std::runtime_error: Hash full)
/var/spool/slurm/job12280/slurm_script: line 18:  7132 Aborted (core dumped) 

I am unsure if this is due to the increase in the size of k, or due to the -d command added, or both.

What is the reason for wanting to save jellyfish hashes? i.e. where will I use it?

Kind regards, Allison

maplesond commented 5 years ago

Hi Allison, Yes, upping the K value will increase memory usage. I don't believe dumping hashes has any impact on memory but will slow KAT down a bit. In general, there isn't much reason to do this, KAT can process reads from a FASTQ faster than reading a hash at present.

gonzalogacc commented 5 years ago

Hi Allison, try setting -H and -g -H10000000000 -g this will initiate the hash to a big enough size and prevent kat from growing it in the middle of the count, this will give you a better chance to complete the count all at once.

If this fails try using directly jellyfish 2 to count the kmers and then use the jellyfish hash as input for kat. Page 3 of the manuel explains why

Best Gonza.-

bjclavijo commented 5 years ago

Closing this as it seems to have been solved.

Best,

bj