Closed StefanoLonardi closed 7 years ago
Hi Stefano,
We tested it and it does work.
You should copy do something like
cp raw_reads.[0-9]*.las <directory-to-run-hinge-in>
cp raw_reads.db <directory-to-run-hinge-in>
cp .raw_reads* <directory-to-run-hinge-in>
and then you can run a script like the one attached
LAmerge raw_reads.las raw_reads.*.las
DASqv -c100 raw_reads raw_reads.las
mkdir -p log
Reads_filter --db raw_reads --las raw_reads.las -x raw_reads --config ../utils/nominal.ini
hinging --db raw_reads --las raw_reads.las -x raw_reads --config ../utils/nominal.ini -o raw_reads
python pruning_and_clipping.py raw_reads.edges.hinges raw_reads.hinge.list demo
python get_draft_path.py $PWD raw_reads raw_readsdemo.G2.graphml
draft_assembly --db raw_reads --las raw_reads.las --prefix raw_reads --config ../utils/nominal.ini --out raw_reads.draft
python correct_head.py raw_reads.draft.fasta raw_reads.draft.pb.fasta draft_map.txt
fasta2DB draft raw_reads.draft.pb.fasta
HPC.daligner raw_reads draft | bash -v
rm draft.*.raw_reads.*.las
LAmerge draft.raw_reads.las draft.raw_reads.*.las
consensus draft raw_reads draft.raw_reads.las raw_reads.consensus.fasta ../utils/nominal.ini
python get_consensus_gfa.py $PWD raw_reads raw_readsdemo.G2.graphml raw_reads.consensus.fasta
Thanks, Govinda.
Fantastic ... I will give it a try. Thanks a lot.
I have been able to merge all the FALCON las files, but now "Reads_filter" has been running for five days. I understand that my las file is big (1.5Tb), but how long should I let it run? Here is the content of my working directory.
lrwxrwxrwx 1 stelo stelo 57 Oct 15 15:16 .raw_reads.bps -> /scratch12/stelo/falcon49x_runA/0-rawreads/.raw_reads.bps
-rw-rw-r-- 1 stelo stelo 0 Oct 16 07:45 raw_reads.coverage.txt
lrwxrwxrwx 1 stelo stelo 55 Oct 15 15:16 raw_reads.db -> /scratch12/stelo/falcon49x_runA/0-rawreads/raw_reads.db
-rw-rw-r-- 1 stelo stelo 0 Oct 16 07:45 raw_reads.filtered.fasta
-rw-rw-r-- 1 stelo stelo 0 Oct 16 07:45 raw_reads.hinges.txt
-rw-rw-r-- 1 stelo stelo 0 Oct 16 07:45 raw_reads.homologous.txt
lrwxrwxrwx 1 stelo stelo 57 Oct 15 15:16 .raw_reads.idx -> /scratch12/stelo/falcon49x_runA/0-rawreads/.raw_reads.idx
-rw-rw-r-- 1 stelo stelo 1553265615538 Oct 15 01:05 raw_reads.las
-rw-rw-r-- 1 stelo stelo 0 Oct 16 07:45 raw_reads.mas
-rw-rw-r-- 1 stelo stelo 28323280 Oct 15 21:20 .raw_reads.qual.anno
-rw-rw-r-- 1 stelo stelo 308007744 Oct 15 21:20 .raw_reads.qual.data
-rw-rw-r-- 1 stelo stelo 0 Oct 16 07:45 raw_reads.repeat.txt
I am not sure if it was made more efficient since I tried, but I was unable to get past the reads_filter
step even with 512GB RAM for a 343 GB las file. @govinda-kamath told me at the time that:
Both Reads_filter and hinging load the entire las file into memory.
Are you using >1.5 TB RAM?
Yes, unfortunately the current reads_filter and hinging modules (now "hinge filter" and "hinge layout") are still loading the entire las files into memory.
We are working on changing our memory handling, and we will update you once we have news on that front.
Thanks. I have only 512GB of RAM. I will stop the process.
Any news on the ability of HINGE to handle very large LAS files? I would still like to try it
Yes you can. Here is a script you can base your script on. We split the large las file into smaller files and input them one at a time.
LAmerge yeast.las $(find . -regex '.*/yeast.[0-9]+\.las' -exec basename {} \;)
rm $(find . -regex '.*/yeast.[0-9]+\.las' -exec basename {} \;)
LAsplit -v yeast.# 30 < yeast.las
DASqv -c100 yeast yeast.las
mkdir log
hinge filter --db yeast --las yeast --mlas -x yeast --config nominal.ini
hinge maximal --db yeast --las yeast --mlas -x yeast --config nominal.ini
hinge layout --db yeast --las yeast --mlas -x yeast --config nominal.ini -o yeast
hinge clip yeast.edges.hinges yeast.hinge.list demo
hinge draft-path $PWD yeast yeastdemo.G3.graphml
hinge draft --db yeast --las yeast.las --prefix yeast --config nominal.ini --out yeast.draft
hinge correct-head yeast.draft.fasta yeast.draft.pb.fasta draft_map.txt
fasta2DB draft yeast.draft.pb.fasta
HPC.daligner yeast draft | bash -v
rm draft.*.yeast.*.las
LAmerge draft.yeast.las draft.yeast.*.las
hinge consensus draft yeast draft.yeast.las yeast.consensus.fasta nominal.ini
hinge gfa $PWD yeast yeast.consensus.fasta
I have a similar problem, with a joined .las file being 2TB large (a combination of pacbio and nanopore data). hinge filter
ran out of 512GB of RAM in like 10 minutes :)
While merging all the DAligner-produced individual .las
files (I'm following the ecoli and ecoli-nanopore demo scripts), I have also kept separate .las
files (35 x 60GB); using those, I have then tried using --mlas
: hinge filter --db hinge --las hinge --mlas -x hinge_tmp --config nominal.ini
This appeared to read and process each file separately - but memory was not freed after each file, so after about 5 files Reads_filter ran out of RAM and was killed.
Am I missing some additional arguments?
I can try re-running DAligner with higher -h -l
(as suggested in one of the other issues), but if there is a simple fix to running hinge filter
- then I'd rather use the fix instead :)
Right now I've installed Hinge from the master branch, and I'm willing to test any memory-freeing patches/fixes on my dataset.
Update: I guess the problem might come from re-using hinge.xx.las
files, instead of merging-then-splitting-again, as in your example?
It looks like our interface has trouble with a recent daligner change (and our code to free memory is not behaving as we expect). We're working on this and will get back with a fix ASAP. Thanks for pointing this out to us.
@govinda-kamath , thank you for a quick response!
Some additional information:
daligner
- exactly the commit from 2016 that submodule was pointing to. However, for a re-run (next point) I'm now using the latest daligner
(because of the per-thread file handling improvement)DBdust
, and increased daligner
-l
and -h
twofold (to 2000
and 70
, respectively). The final merged file is now 817 GB, trying to split it and run hinge filter
again...Hi, @spock, in the latest master
branch we addressed the problem of memory footprint. You should be able to run it with the 35 * 60GB las files.
Great, thank you! I'll make a test run within the next few days. Hmm, I'll actually start right now :)
Thanks 👍 , keep us posted :)
@fxia22 , thank you for this quick fix!
It does work now: processing is already beyond the previous "out of RAM" point, and after processing 7 files RES
is ~216 GB - and only grows a little with each new file. The process hasn't finished yet, but I'm now optimistic about it :)
P.S. I'm slightly concerned about the pattern of RAM allocation/freeing, but I haven't looked at the code long enough to know if this is supposed to be so. I was monitoring in top
(only the %MEM
column, so I don't know if VIRT
and RES
were changing independently) the moment of switching to the next file, and I haven't seen any memory release - it seems that instead currently-allocated memory is marked as free internally, and then re-used for the next file. If this is the expected behavior - then all is fine :)
Thanks for your feedback @spock .
Yes, this is the expected, it will free the memory and re-allocate when switching from file to file. We did some benchmarking using 7 .las blocks and the memory usage looks like this. The dip is when free and re-allocate happens.
I see, I guess I failed to notice the release-reallocation pattern because of 1+ seconds of default top
refresh rate...
P.S. Nice RAM use plot! How did you make it?
Hello, I have been running Falcon for a while on a large set of pacbio reads, and I was wondering whether I could reuse the all-pairs daligner step that FALCON carries out and feed these results into HINGE. Is this possible? If so, is just a matter of renaming files? Please advise. Thanks.
Stefano