Closed arthurmelobio closed 8 years ago
For smaller memory machines, one can try smaller "-s" in DBsplit
to reduce memory footprint, however, it will generate more intermediate files. This is part of daligner
design and I have not find a way to work around it. Also, instead of using -t
to reduce repetitive k-mer, one can use -M
in pa_HPCdaligner_option
to cap the memory usage. Since you already set that, do you check the if daligner
uses way more than 20G from the logs in sge_log/
? It seems that you have good read length and high coverage, in that case, you might try to make the alignment parameter more sensitive (by changing -h
, -k
, -l
and -w
in daligner
parameters) and use less computation.
For smaller memory machines, one can try smaller "-s" in DBsplit to reduce memory footprint, however, it will generate more intermediate files.
To keep the number of post-daligner files roughly constant, increase -dal
(number of daligner block-comparisons per call) twice as much as you reduce -s
(number of MB per block). We merge all the first level of .las
files immediately after running daligner, so the number of files before the merge step would remain roughly constant. Run with lower concurrency to reduce the number of simultaneous .las
files. And if you run only 1 daligner job per machine, then you can drop -t
and use -M0
to use all available memory and nothing more.
Later, wait for an update to **FALCON*, and you'll be able to run the jobs on the local disk (not currently implemented, though it was available in an earlier version), so the cost of the temporary .las
files will be small.
But it's wise to go into a single job-directory and fiddle with the daligner settings as pb-jchin describes. You can run a single job yourself, "manually", and you can see how different settings affect the numbers of alignments in the resulting .las
files. It's good to get some experience with smaller genomes first.
Hi Jason Chin and Christopher Dunn, thanks both of you for the quick feedback. I appreciated.
On sge_log I have:
cd /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads
cd /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads
trap 'touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit' EXIT
trap 'touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit' EXIT
ls -il prepare_rdb.sub.sh
ls -il prepare_rdb.sub.sh
231260396382642738 -rwxr-xr-x 1 iplant G-802086 495 Apr 25 21:28 prepare_rdb.sub.sh
hostname
hostname
c426-601.stampede.tacc.utexas.edu
ls -il prepare_rdb.sub.sh
ls -il prepare_rdb.sub.sh
231260396382642738 -rwxr-xr-x 1 iplant G-802086 495 Apr 25 21:28 prepare_rdb.sub.sh
time /bin/bash ./prepare_rdb.sub.sh
/bin/bash ./prepare_rdb.sub.sh
fasta2DB -v raw_reads -f/scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/input.fofn
fasta2DB -v raw_reads -f/scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/input.fofn
DBsplit -x2500 -s200 raw_reads
DBsplit -x2500 -s200 raw_reads
LB=$(cat raw_reads.db | awk '$1 == "blocks" {print $3}')
cat raw_reads.db | awk '$1 == "blocks" {print $3}')
cat raw_reads.db | awk '$1 == "blocks" {print $3}'
cat raw_reads.db
awk '$1 == "blocks" {print $3}'
LB=582
HPCdaligner -M20 -dal128 -t18 -e0.75 -l100 -s500 -h1250 -H18600 raw_reads 1-$LB >| /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/run_jobs.sh
HPCdaligner -M20 -dal128 -t18 -e0.75 -l100 -s500 -h1250 -H18600 raw_reads 1-582
real 48m5.262s
user 7m20.127s
sys 13m22.532s
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit
touch /scratch/0004/iplant/arthurmelo/job-3650947532579925530-242ac1112-0001-007-f7b10a12-d404-4e61-a9d7-fa229636d122_0001/0-rawreads/rdb_build_done.exit
It seems to me the daligner are using 20G by the -M20 flag.
Unfortunately, using Falcon on iPlant/CyVerse I can't control -dal, -s and -w flags. They are automatically set like -dal128, -s500 on both "pa_HPCdaligner_option" and ovlp_HPCdaligner_option and -s 200 on both "pa_DBsplit_option" and "ovlp_DBsplit_option".
According to the daligner parameters you suggest to change (-h, -k, -l and -w) in order to decrease the memory usage, what you recommend? For example: we are using -h1250, -k18, -l100. For decrease my memory usage, probably I can increase -l2500 and -k20, keeping -h1250. Do you agree that this modifications could decrese the memory consumption?
Thanks a lot. Arthur
You can't set pa_HPCdaligner_option
yourself, or the others? If iPlant/CyVerse is setting these for you, then talk to the owners of that system. We cannot help with that.
I'll let the expert give advice on -h -k -l -w
, but it can be difficult remotely. Plus, Jason is very busy.
Hi Christopher, thanks for your replay.
On "pa_HPCdaligner_option" and "oval_HPCdaligner_option" I can set some flags like -t, -e, -l and -h. Other like -M -dal, -s and -w I can't. On "pa_DBsplit_option" and "oval_DBsplit_option" I can control -x flag, while the -s I can't.
For sure I will contact the CyVerse team and the appropriate person who installed and hosted Falcon 0.4.2 on CyVerse. However, my question here is understand how the daligner flags -t, -e, -l and -h affect the memory consumption of the analysis, in order to set them appropriated for assembly a huge dataset (116 SMRT cells ~ 120X of coverage).
Thank you again.
-e
is the error rate, or rather one minus the error rate. It needs to match your data.
-t
does the same as -M
, but in a different way. -M
stores as many kmer hits as fit in memory, while -t
ignores anything with a high frequency. Both are useful for filtering homopolymer runs, and somewhat useful for ignoring repeats. But mainly, they just reduce memory and runtime by ignoring kmer hits that are probably spurious anyway.
But I wouldn't use -t
to control total memory. You need to work with your system owners.
Hi Falcon team,
I’m trying to assembly a diploid plant genome with estimated size in approximately 800 Mb. We have 116 SMRT cells, 12,190,085 of raw reads (from Secondary Analysis Results) and on average 120X of coverage. In addition, the 30X longest read have, on average, 18,600 bp. The parameters used for our project was based on “pb-jchin” suggests for assemble a 1Gb plant genome (Falcon Github discussion #308) and two rounds of discussion with Roberto Lleras. We are working using Falcon 0.4.2 installed on iplant/CyVerse. The job configuration file was set like:
We have a couple of problems… The CyVerse reports that analysis was completed but the “2-asm-falcon” folder is empty. Unfortunately, the analysis on CyVerse does not report the intermediary folders like “0-rawreads” and “1-preads-ovl” which do not allow me check either the files “raw_reads.db” and “prepare_rdb.sh.log”, for example.
However, looking for the error file (*.err) that is a huge file (27.8 Mb) I found in the begin of file some warnings and one error:
Looking more detail to the error “PypeTaskBase failed: PypeThreadTaskBase” I found different discussion on GitHub like the thread #173 where it seems to me the –t16 daligner parameter is too small for big genomes. However, on Falcon thread #308, “pb-jchin” suggest –t18 is enough for assemble a 1Gb plant genome. The author of thread #173 also said when he decrease “length_cutoff” (default is 12000) the issue was resolved. We are using 18600 because this is your 30x longest read length and I don’t know if this is the point of error. It seems to me daligner do not works producing the .las files, creating the preads for overlap assembly.
In addition, in the end of the error file (*.err) we have another error message:
According to this last error (RefreshTargets() indicates an unrecoverable error. Shutting down...), I found on discussion #300 that the flag “local_match_count_threshold” could be the cause of the problem, but we don’t have this flag in our configuration file.
There are some interesting points to me:
The *.out file produce the follow message:
Again, it seems to me the CyVerse do not assign enough memory to assembly this whole dataset (116 SMRT cells ~ 120X). But, I don’t know if I’m correct in this affirmative and I also don’t know if the two error messages could be printed due the lack of memory and if they are related each other. So any suggestion/thoughts will be appreciated.
Thanks a lot for any help!