Question: can I reuse the FALCON alignments in HINGE?

StefanoLonardi commented 8 years ago

Hello, I have been running Falcon for a while on a large set of pacbio reads, and I was wondering whether I could reuse the all-pairs daligner step that FALCON carries out and feed these results into HINGE. Is this possible? If so, is just a matter of renaming files? Please advise. Thanks.

Stefano

govinda-kamath commented 7 years ago

Hi Stefano,

We tested it and it does work.

You should copy do something like

cp raw_reads.[0-9]*.las <directory-to-run-hinge-in>

cp raw_reads.db <directory-to-run-hinge-in>

cp .raw_reads* <directory-to-run-hinge-in>

and then you can run a script like the one attached

LAmerge raw_reads.las raw_reads.*.las
DASqv -c100 raw_reads raw_reads.las

mkdir -p log

Reads_filter --db raw_reads --las raw_reads.las -x raw_reads --config ../utils/nominal.ini
hinging --db raw_reads --las raw_reads.las -x raw_reads --config ../utils/nominal.ini -o raw_reads

python pruning_and_clipping.py raw_reads.edges.hinges raw_reads.hinge.list demo

python get_draft_path.py $PWD raw_reads raw_readsdemo.G2.graphml
draft_assembly --db raw_reads --las raw_reads.las --prefix raw_reads --config ../utils/nominal.ini --out raw_reads.draft

python correct_head.py raw_reads.draft.fasta raw_reads.draft.pb.fasta draft_map.txt 
fasta2DB draft raw_reads.draft.pb.fasta

HPC.daligner raw_reads draft | bash -v 

rm draft.*.raw_reads.*.las
LAmerge draft.raw_reads.las draft.raw_reads.*.las

consensus draft raw_reads draft.raw_reads.las raw_reads.consensus.fasta ../utils/nominal.ini

python get_consensus_gfa.py $PWD raw_reads raw_readsdemo.G2.graphml raw_reads.consensus.fasta

Thanks, Govinda.

StefanoLonardi commented 7 years ago

Fantastic ... I will give it a try. Thanks a lot.

StefanoLonardi commented 7 years ago

I have been able to merge all the FALCON las files, but now "Reads_filter" has been running for five days. I understand that my las file is big (1.5Tb), but how long should I let it run? Here is the content of my working directory.

lrwxrwxrwx  1 stelo stelo            57 Oct 15 15:16 .raw_reads.bps -> /scratch12/stelo/falcon49x_runA/0-rawreads/.raw_reads.bps
-rw-rw-r--  1 stelo stelo             0 Oct 16 07:45 raw_reads.coverage.txt
lrwxrwxrwx  1 stelo stelo            55 Oct 15 15:16 raw_reads.db -> /scratch12/stelo/falcon49x_runA/0-rawreads/raw_reads.db
-rw-rw-r--  1 stelo stelo             0 Oct 16 07:45 raw_reads.filtered.fasta
-rw-rw-r--  1 stelo stelo             0 Oct 16 07:45 raw_reads.hinges.txt
-rw-rw-r--  1 stelo stelo             0 Oct 16 07:45 raw_reads.homologous.txt
lrwxrwxrwx  1 stelo stelo            57 Oct 15 15:16 .raw_reads.idx -> /scratch12/stelo/falcon49x_runA/0-rawreads/.raw_reads.idx
-rw-rw-r--  1 stelo stelo 1553265615538 Oct 15 01:05 raw_reads.las
-rw-rw-r--  1 stelo stelo             0 Oct 16 07:45 raw_reads.mas
-rw-rw-r--  1 stelo stelo      28323280 Oct 15 21:20 .raw_reads.qual.anno
-rw-rw-r--  1 stelo stelo     308007744 Oct 15 21:20 .raw_reads.qual.data
-rw-rw-r--  1 stelo stelo             0 Oct 16 07:45 raw_reads.repeat.txt

JohnUrban commented 7 years ago

I am not sure if it was made more efficient since I tried, but I was unable to get past the reads_filter step even with 512GB RAM for a 343 GB las file. @govinda-kamath told me at the time that:

Both Reads_filter and hinging load the entire las file into memory.

Are you using >1.5 TB RAM?

ilanshom commented 7 years ago

Yes, unfortunately the current reads_filter and hinging modules (now "hinge filter" and "hinge layout") are still loading the entire las files into memory.

We are working on changing our memory handling, and we will update you once we have news on that front.

StefanoLonardi commented 7 years ago

Thanks. I have only 512GB of RAM. I will stop the process.

StefanoLonardi commented 7 years ago

Any news on the ability of HINGE to handle very large LAS files? I would still like to try it

govinda-kamath commented 7 years ago

Yes you can. Here is a script you can base your script on. We split the large las file into smaller files and input them one at a time.

LAmerge yeast.las $(find . -regex '.*/yeast.[0-9]+\.las' -exec basename {} \;)

rm $(find . -regex '.*/yeast.[0-9]+\.las' -exec basename {} \;)

LAsplit -v yeast.# 30 < yeast.las

DASqv -c100 yeast yeast.las

mkdir log

hinge filter --db yeast --las yeast --mlas -x yeast --config nominal.ini
hinge maximal --db yeast --las yeast --mlas -x yeast --config nominal.ini

hinge layout --db yeast --las yeast --mlas -x yeast --config nominal.ini -o yeast

hinge clip yeast.edges.hinges yeast.hinge.list demo

hinge draft-path $PWD yeast yeastdemo.G3.graphml
hinge draft --db yeast --las yeast.las --prefix yeast --config nominal.ini --out yeast.draft

hinge correct-head yeast.draft.fasta yeast.draft.pb.fasta draft_map.txt 
fasta2DB draft yeast.draft.pb.fasta

HPC.daligner yeast draft | bash -v 

rm draft.*.yeast.*.las
LAmerge draft.yeast.las draft.yeast.*.las

hinge consensus draft yeast draft.yeast.las yeast.consensus.fasta nominal.ini

hinge gfa $PWD yeast  yeast.consensus.fasta

spock commented 7 years ago

I have a similar problem, with a joined .las file being 2TB large (a combination of pacbio and nanopore data). hinge filter ran out of 512GB of RAM in like 10 minutes :)

While merging all the DAligner-produced individual .las files (I'm following the ecoli and ecoli-nanopore demo scripts), I have also kept separate .las files (35 x 60GB); using those, I have then tried using --mlas: hinge filter --db hinge --las hinge --mlas -x hinge_tmp --config nominal.ini

This appeared to read and process each file separately - but memory was not freed after each file, so after about 5 files Reads_filter ran out of RAM and was killed.

Am I missing some additional arguments?

I can try re-running DAligner with higher -h -l (as suggested in one of the other issues), but if there is a simple fix to running hinge filter - then I'd rather use the fix instead :)

Right now I've installed Hinge from the master branch, and I'm willing to test any memory-freeing patches/fixes on my dataset.

Update: I guess the problem might come from re-using hinge.xx.las files, instead of merging-then-splitting-again, as in your example?

govinda-kamath commented 7 years ago

It looks like our interface has trouble with a recent daligner change (and our code to free memory is not behaving as we expect). We're working on this and will get back with a fix ASAP. Thanks for pointing this out to us.

spock commented 7 years ago

@govinda-kamath , thank you for a quick response!

Some additional information:

I was actually using the "older" daligner - exactly the commit from 2016 that submodule was pointing to. However, for a re-run (next point) I'm now using the latest daligner (because of the per-thread file handling improvement)
I have realized that something might be wrong with my data - it is unreasonable to receive 2TBs of pairwise alignments from a ~7Gbp dataset... Also, during some simple QC, I had seen very long polyX stretches in some datasets. So I'm now re-running everything, and this time I've applied DBdust, and increased daligner -l and -h twofold (to 2000 and 70, respectively). The final merged file is now 817 GB, trying to split it and run hinge filter again...

fxia22 commented 7 years ago

Hi, @spock, in the latest master branch we addressed the problem of memory footprint. You should be able to run it with the 35 * 60GB las files.

spock commented 7 years ago

Great, thank you! I'll make a test run within the next few days. Hmm, I'll actually start right now :)

fxia22 commented 7 years ago

Thanks 👍 , keep us posted :)

spock commented 7 years ago

@fxia22 , thank you for this quick fix!

It does work now: processing is already beyond the previous "out of RAM" point, and after processing 7 files RES is ~216 GB - and only grows a little with each new file. The process hasn't finished yet, but I'm now optimistic about it :)

P.S. I'm slightly concerned about the pattern of RAM allocation/freeing, but I haven't looked at the code long enough to know if this is supposed to be so. I was monitoring in top (only the %MEM column, so I don't know if VIRT and RES were changing independently) the moment of switching to the next file, and I haven't seen any memory release - it seems that instead currently-allocated memory is marked as free internally, and then re-used for the next file. If this is the expected behavior - then all is fine :)

fxia22 commented 7 years ago

Thanks for your feedback @spock .

Yes, this is the expected, it will free the memory and re-allocate when switching from file to file. We did some benchmarking using 7 .las blocks and the memory usage looks like this. The dip is when free and re-allocate happens.

pasted image at 2017_06_29 08_50 pm

spock commented 7 years ago

I see, I guess I failed to notice the release-reallocation pattern because of 1+ seconds of default top refresh rate...

P.S. Nice RAM use plot! How did you make it?

fxia22 commented 7 years ago

Thanks, I used massif to plot this.

HingeAssembler / HINGE

Question: can I reuse the FALCON alignments in HINGE? #77