Kmer-File-Format / kff-tools

GNU Affero General Public License v3.0
17 stars 4 forks source link

state of kff-tools on human-scale data #10

Closed rchikhi closed 1 year ago

rchikhi commented 3 years ago

Test setup: human reads from https://www.ncbi.nlm.nih.gov/sra/?term=SRR034956 158G all.fastq.gz ran KMC to produce a KFF file:

\time kmc -fq -okff -k31 -ci2  all.fastq.gz kmc_k31 .
   Total no. of k-mers                :  94768162610
   Total no. of reads                 :   1415483596
   Total no. of super-k-mers          :   9005121051

33G kmc_k31.kff

status tool time memory
☑️ kmc 41m 11 GB
☑️ kff-tools outstr 2h25 < 1 MB
☑️ kff-tools instr 40m < 1 MB
☑️ kff-tools translate 5m < 1 MB
☑️ kff-tools data-rm 5m < 1 MB
☑️ kff-tools validate 3m < 1 MB
☑️ kff-tools disjoin 9m < 1 MB
☑️ kff-tools split 48m < 1 MB
☑️ kff-tools merge 1h < 1 MB
☑️ kff-tools sort 28m 1.4 GB
☑️ kff-tools shuffle 47m 1.4 GB
kff-tools bucket

Remarks:

yoann-dufresne commented 3 years ago

Bucket tool currently under optimization (bucket branch)

yoann-dufresne commented 3 years ago

bucket is now working on Human genome Compact is still too slow