Open pb-cdunn opened 8 years ago
Given a set of gzipped fastas in a fofn,
$ ls -ltrah 00-orig/
total 32G
-rwxrwxr-x 1 cdunn Domain Users 278 Jun 11 10:22 load_orig_db.xsub.sh
-rw-rw-r-- 1 cdunn Domain Users 626M Jun 11 10:47 orig.db
-rw-rw-r-- 1 cdunn Domain Users 31G Jun 11 10:47 .orig.bps
-rw-rw-r-- 1 cdunn Domain Users 0 Jun 11 10:47 load_orig_db_done.exit
-rw-rw-r-- 1 cdunn Domain Users 0 Jun 11 10:47 load_orig_db_done
-rw-rw-r-- 1 cdunn Domain Users 373M Jun 11 10:47 .orig.idx
The program basically works, and it converts 100x reads of a 3Gnt genome into DAZZ_DB in only 25min, serially. Good!
real 24m50.938s
user 32m12.372s
sys 3m45.579s
There is still a problem that the DB ASCII description, orig.db
is 626MB
. That is caused by the Issue tracked here:
I am thinking about writing from BAM to .dexta
instead of .gzip
. It offers better compression (guaranteed 4x versus typically 3x) with lower runtime. For future comparison, here is an example of lustre-to-lustre gunzip:
$ time cp /lustre/hpcprod/cdunn/work/crow/job_output/tasks/falcon_ns.tasks.task_hgap_run-0/run-bam2fasta/fasta_job_000/chunk_000.fasta.gz foo.fasta.gz
real 0m44.915s
user 0m0.014s
sys 0m14.440s
# That is just a lustre-to-lustre copy, for comparison.
$ time dexta -vk foo.fasta
Processing 'foo' ...
Done
real 0m52.560s
user 0m33.727s
sys 0m18.747s
$ time gzip -k --fast foo.fasta
real 7m30.484s
user 5m32.635s
sys 1m57.427s
# It's a bit faster writing into /tmp.
$ time gzip -k -c --fast foo.fasta > /scratch/foo.fasta.gz
real 5m49.356s
user 5m11.932s
sys 0m37.160s
$ time gzip -k foo.fasta
# With the default, the result is 10% smaller than from bam2fasta gzipping, but much slower
real 52m59.413s
user 51m58.728s
sys 1m0.041s
$ ls -Gglh
-rw-rw-r-- 1 3.4G Jun 11 12:16 foo.dexta
-rw-rw-r-- 1 14G Jun 11 11:48 foo.fasta
-rw-rw-r-- 1 4.7G Jun 11 12:19 foo.fasta.gz
3.99x
vs. 2.88x
compression. 10x faster. We need to stop using gzip.
Here are stats for uncompressing:
$ time gunzip -c chunk_000.fasta.gz > foo.fasta
real 4m22.164s
user 2m24.364s
sys 1m51.633s
(I will update with undexta
after we fix a bug.)
This Issue is to track the development of a new program to import gzipped fastas into a new DAZZ_DB.