PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

daligner: Stub file (.db) of raw_reads is junk when increasing -s to 1000 for pa_DBsplit_option #327

Open aclum opened 8 years ago

aclum commented 8 years ago

Hi, We are testing increasing -s for pa_DBsplit_option to reduce the number of inodes. A test with -s 400 completes without any problems. However when we increase to -s 1000 there are no errors from creating the database but then when daliger goes to run we get the following error: daligner: Stub file (.db) of raw_reads is junk

Here is the log from making the database. No obvious errors as far as I can tell:

aclum@gpint209:/global/projectb/scratch/aclum/falcon/AWSBW/s1000/sge_log$ more prepare_rdb.sh-task_build_rdb-task_build_rdb.o20863516
cd /global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads
+ cd /global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads
trap 'touch /global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads/rdb_build_done.exit' EXIT
+ trap 'touch /global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads/rdb_build_done.exit' EXIT
ls -il prepare_rdb.sub.sh
+ ls -il prepare_rdb.sub.sh
162551083 -rwxrwxr-x 1 aclum aclum 349 Apr 12 10:25 prepare_rdb.sub.sh
hostname
+ hostname
mc0214
ls -il prepare_rdb.sub.sh
+ ls -il prepare_rdb.sub.sh
162551083 -rwxrwxr-x 1 aclum aclum 349 Apr 12 10:25 prepare_rdb.sub.sh
time /bin/bash ./prepare_rdb.sub.sh
+ /bin/bash ./prepare_rdb.sub.sh
fasta2DB -v raw_reads -f/global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads/input.fofn
+ fasta2DB -v raw_reads -f/global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads/input.fofn
Adding 'pbio-1082.9872-nocontrol' ...
Adding 'pbio-1084.9888-nocontrol' ...
Adding 'pbio-1084.9889-nocontrol' ...
Adding 'pbio-1084.9890-nocontrol' ...
Adding 'pbio-1104.10055-nocontrol' ...
Adding 'pbio-1107.10076-nocontrol' ...
Adding 'pbio-1107.10077-nocontrol' ...
Adding 'pbio-1107.10078-nocontrol' ...
Adding 'pbio-1107.10079-nocontrol' ...
Adding 'pbio-1107.10080-nocontrol' ...
Adding 'pbio-1107.10081-nocontrol' ...
DBsplit -x500 -s1000 raw_reads
+ DBsplit -x500 -s1000 raw_reads
LB=$(cat raw_reads.db | awk '$1 == "blocks" {print $3}')
cat raw_reads.db | awk '$1 == "blocks" {print $3}')
cat raw_reads.db | awk '$1 == "blocks" {print $3}'
++ cat raw_reads.db
++ awk '$1 == "blocks" {print $3}'
+ LB=8
HPCdaligner -v -dal4 -t8 -e.70 -l1000 -s1000 -H1000 raw_reads 1-$LB >| /global/projectb/scratch/aclum/falcon/AWSBW/s1000/0-rawreads/run_jobs.sh
+ HPCdaligner -v -dal4 -t8 -e.70 -l1000 -s1000 -H1000 raw_reads 1-8

real    2m0.948s
user    1m36.668s
sys     0m22.550s
pb-cdunn commented 8 years ago

Interesting. Could you cat raw_reads.db? (It's an ASCII file.) Then try DBstats raw_reads and post that too.

How much RAM do you have on your machine? (You need > 160GB, I think.)

aclum commented 8 years ago
0-rawreads$ cat
raw_reads.db
files =        11
     301768 pbio-1082.9872-nocontrol
m160121_000223_00123_c100912372550000001823199404301605_s1_p0
     624094 pbio-1084.9888-nocontrol
m160123_022914_00123_c100902972550000001823200604301645_s1_p0
     950618 pbio-1084.9889-nocontrol
m160123_064729_00123_c100902972550000001823200604301646_s1_p0
    1289306 pbio-1084.9890-nocontrol
m160123_110636_00123_c100902972550000001823200604301647_s1_p0
    1710388 pbio-1104.10055-nocontrol
m160225_020933_00123_c100930082550000001823209005251654_s1_p0
    2106027 pbio-1107.10076-nocontrol
m160226_074755_00123_c100929792550000001823209005251671_s1_p0
    2524588 pbio-1107.10077-nocontrol
m160226_120708_00123_c100929792550000001823209005251672_s1_p0
    2935732 pbio-1107.10078-nocontrol
m160226_162622_00123_c100929792550000001823209005251673_s1_p0
    3349957 pbio-1107.10079-nocontrol
m160226_204534_00123_c100929792550000001823209005251674_s1_p0
    3760263 pbio-1107.10080-nocontrol
m160227_010858_00123_c100929792550000001823209005251675_s1_p0
    4082605 pbio-1107.10081-nocontrol
m160227_052813_00123_c100929792550000001823209005251676_s1_p0
blocks =         8
size = 1000000000 cutoff =       500 all = 0
         0         0
    558042    159986
   1054378    310087
   1563525    456980
   2141886    614052
   2792390    774164
   3426911    936139
   4006876   1095519
   4082605   1121874
0-rawreads$
DBstats raw_reads
DBstats: Stub file (.db) of raw_reads is junk

The machine that I was testing on has 120G of memory. Any idea what the max size of -s we could use for running on 120G machine which is the majority of our hardware?

pb-cdunn commented 8 years ago

Well, Gene estimates 16GB for -s200, but you'll need extra memory for high kmer counts (the -M parameter). Anyway, your problem is a bug in DBsplit. I think this might be worth mentioning to GeneMeyers:

If he doesn't respond, I'll look into this myself next week.

aclum commented 8 years ago

Ok I've filed a bug with Gene. https://github.com/thegenemyers/DAZZ_DB/issues/16

For the record I can go up to -s 975 with 120G.

On Tue, Apr 12, 2016 at 3:54 PM, Christopher Dunn notifications@github.com wrote:

Well, Gene estimates 16GB for -s200, but you'll need extra memory for high kmer counts (the -M parameter). Anyway, your problem is a bug in DBsplit. I think this might be worth mentioning to GeneMeyers:

If he doesn't respond, I'll look into this myself next week.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/PacificBiosciences/FALCON/issues/327#issuecomment-209136340