PacificBiosciences / FALCON_unzip

Making diploid assembly becomes common practice for genomic study
BSD 3-Clause Clear License
30 stars 18 forks source link

unzip uses too much memories #111

Closed nottwy closed 6 years ago

nottwy commented 6 years ago

Dear developer,

We found that unzip module 'rr_ctg_track.py' try to read all .las files into memory and we had around 20T .las files. It's hard to find a machine with so large memories. Do you have any suggestion to avoid loading all data into memories?

3ku!

nottwy commented 6 years ago

This issue (https://github.com/marbl/canu/issues/838) is created by me and is the same problem. You can get more information from it .

pb-cdunn commented 6 years ago

The memory is probably not consumed by rr_ctg_track.py directly. That program spawns LA4Falcon for each .las file. You will have a number of instances of LA4Falcon running equal to your --n-core argument. (And each will be under a different sub-process, so the memory used by rr_ctg_track will be cloned. That's probably not a problem, but you can look at the forked python procs on your machine.)

Each LA4Falcon loads the entire DAZZLER DB, which is probably your problem. (Look at the file 0-rawreads/.raw_reads.bps) There are 2 solutions:

  1. Hack our code to load the DB from dev/shm. (Non-trivial, but one user has done this.)
  2. Use --n-core=0. (Same as --n-core=1, but simpler, since it avoids the whole "multiprocessing" module.)

You can experiment with various values of --n-core.

Also, your unzip might be out-of-date. You could try the Falcon-unzip binary tarball, as the GitHub code is not up-to-date.

nottwy commented 6 years ago

The explaination is really clear and I believe the solution you provided must be useful. I'll try as you said. 3ku.