PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

New fasta2fofn #383

Open pb-cdunn opened 8 years ago

pb-cdunn commented 8 years ago

This Issue is to track the development of a new program to import gzipped fastas into a new DAZZ_DB.

pb-cdunn commented 8 years ago

Given a set of gzipped fastas in a fofn,

$ ls -ltrah 00-orig/
total 32G
-rwxrwxr-x 1 cdunn Domain Users  278 Jun 11 10:22 load_orig_db.xsub.sh
-rw-rw-r-- 1 cdunn Domain Users 626M Jun 11 10:47 orig.db
-rw-rw-r-- 1 cdunn Domain Users  31G Jun 11 10:47 .orig.bps
-rw-rw-r-- 1 cdunn Domain Users    0 Jun 11 10:47 load_orig_db_done.exit
-rw-rw-r-- 1 cdunn Domain Users    0 Jun 11 10:47 load_orig_db_done
-rw-rw-r-- 1 cdunn Domain Users 373M Jun 11 10:47 .orig.idx

The program basically works, and it converts 100x reads of a 3Gnt genome into DAZZ_DB in only 25min, serially. Good!

real    24m50.938s
user    32m12.372s
sys     3m45.579s

There is still a problem that the DB ASCII description, orig.db is 626MB. That is caused by the Issue tracked here:

pb-cdunn commented 8 years ago

I am thinking about writing from BAM to .dexta instead of .gzip. It offers better compression (guaranteed 4x versus typically 3x) with lower runtime. For future comparison, here is an example of lustre-to-lustre gunzip:

$ time cp /lustre/hpcprod/cdunn/work/crow/job_output/tasks/falcon_ns.tasks.task_hgap_run-0/run-bam2fasta/fasta_job_000/chunk_000.fasta.gz foo.fasta.gz

real    0m44.915s
user    0m0.014s
sys     0m14.440s
# That is just a lustre-to-lustre copy, for comparison.

$ time dexta -vk foo.fasta
Processing 'foo' ...
Done

real    0m52.560s
user    0m33.727s
sys     0m18.747s

$ time gzip -k --fast foo.fasta

real    7m30.484s
user    5m32.635s
sys     1m57.427s

# It's a bit faster writing into /tmp.
$ time gzip -k -c --fast foo.fasta > /scratch/foo.fasta.gz

real    5m49.356s
user    5m11.932s
sys     0m37.160s

$ time gzip -k foo.fasta

# With the default, the result is 10% smaller than from bam2fasta gzipping, but much slower
real    52m59.413s
user    51m58.728s
sys     1m0.041s

$ ls -Gglh
-rw-rw-r-- 1 3.4G Jun 11 12:16 foo.dexta
-rw-rw-r-- 1  14G Jun 11 11:48 foo.fasta
-rw-rw-r-- 1 4.7G Jun 11 12:19 foo.fasta.gz

3.99x vs. 2.88x compression. 10x faster. We need to stop using gzip.

Here are stats for uncompressing:

$ time gunzip -c chunk_000.fasta.gz > foo.fasta

real    4m22.164s
user    2m24.364s
sys     1m51.633s

(I will update with undexta after we fix a bug.)