SamStudio8 / gretel

An algorithm for recovering haplotypes from metagenomes
https://gretel.readthedocs.io/en/latest/
MIT License
31 stars 4 forks source link

Huge temporary files when running gretel on very large contigs #31

Open SamStudio8 opened 4 years ago

SamStudio8 commented 4 years ago

Although Gretel is not designed for recovering large haplotypes, it should at least try its best. Apparently very large contigs will cause Gretel to write very large temporary files and lead to an OSError.

[...]
  File "/home/epi_mher/miniconda2/envs/py3/lib/python3.5/multiprocessing/heap.py", line 231, in malloc
    (arena, start, stop) = self._malloc(size)
  File "/home/epi_mher/miniconda2/envs/py3/lib/python3.5/multiprocessing/heap.py", line 129, in _malloc
    arena = Arena(length)
  File "/home/epi_mher/miniconda2/envs/py3/lib/python3.5/multiprocessing/heap.py", line 81, in __init__
    assert f.tell() == size
OSError: [Errno 28] No space left on device

First reported by @mherold1 in #30.

SamStudio8 commented 4 years ago

Although this is not desired behaviour, it is not high priority as it is off-label use of gretel.

jsgounot commented 3 years ago

Hi. Do you think this behavior will be resolved or managed at some point ? At least it should be specified, I almost crashed my computer trying Gretel just minutes ago. Moreover, it could be good to specify that VCF file has to be bziped, otherwise you got an uninformative pyVCF error. Thanks.

SamStudio8 commented 3 years ago

Hi @jsgounot, thanks for the comment and I'm sorry about locking up all your storage! I don't intend to resolve this any time soon as Gretel is designed for local haplotyping on "short" regions (intuition here https://www.biorxiv.org/content/10.1101/2020.08.10.244848v1). I would love to get the time in future to improve the storage requirements for Hansel to help with this problem but I can't promise anything. Locking up your machine is totally undesired behaviour though, and I should try and catch this use-case with a warning (perhaps one that can be overrriden with --force or something). Out of interest, what was the size of the region you specified?

On your second point I note the requirement is stressed in the README, but you are absolutely right in that it should raise an error on the CLI if it looks the wrong format. Thanks. (#33)

jsgounot commented 3 years ago

Thanks for the reply. Well, I guess it was by far exceeding what we can call a short region, I will try with a real and shorter one (I used a way too large and random test bamfile with hundreds of kb).

SamStudio8 commented 3 years ago

No problem - thanks for taking the time to report. Good luck!

kangxiongbin commented 3 years ago

Although Gretel is not designed for recovering large haplotypes, it should at least try its best. Apparently very large contigs will cause Gretel to write very large temporary files and lead to an OSError.

Hi @SamStudio8, I want to know how large haplotypes Gretel can recover? Can I use Gretel to recover some bacterial genomes in metagenome data? The genome size of these bacteria may be 2~7M.

SamStudio8 commented 3 years ago

Gretel is a proposal to the local MIH problem (defined in our manuscript here https://academic.oup.com/bioinformatics/article/37/10/1360/5988481) and is designed to find shorter regions of interest within metagenomic data. In theory it could recover genomes but in practice those regions are probably too large and will lead to very large intractable matrices. The longest regions I've recovered with it are more on the order of kilobases rather than megabases!

On Tue, Oct 19, 2021 at 3:03 PM kangxiongbin @.***> wrote:

Although Gretel is not designed for recovering large haplotypes, it should at least try its best. Apparently very large contigs will cause Gretel to write very large temporary files and lead to an OSError.

Hi @SamStudio8 https://github.com/SamStudio8, I want to know how large haplotypes Gretel can recover? Can I use Gretel to recover some bacterial genomes in metagenome data? The genome size of these bacteria may be 2~7M.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SamStudio8/gretel/issues/31#issuecomment-946756627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIN6OVBKDDLJI2NX4IT5J3UHV3BPANCNFSM4KD4YJSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.