broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
984 stars 368 forks source link

FixVcfHeader: Possible Memory Leak #1438

Open KevinDuringWork opened 4 years ago

KevinDuringWork commented 4 years ago

I have an upstream process that produces a VCF that is

  1. Not sorted
  2. Missing FILTER: PASS
  3. uncompressed

Scenario:

I need to run Picard:FixVcfHeader as a prerequisite to Picard:SortVcf as well as further annotation VCFAnno. on an uncompressed VCF (80MB - 130MB). FixVcfHeader appears to fix the header immediately but stalls on "Writing the output VCF" and eventually exhausts heap space.

Description:

As a quick sanity check I've added in the header manually + sorted (SortVcf) then removed the header manually and ran FixVcfHeader on the resulting VCF file and it complete in less than a minute.

Analysis:

My best guesses: CREATE_INDEX=false may be ignored and Picard is creating an index on sufficiently random ordered variants can cause a kind of "zip bomb" effect to occur. My resources are scarce and any process even approaching 8GB - 16GB gets automatically OOM by the OS.

yfarjoun commented 4 years ago

Could you run your command with VERBOSITY=DEBUG and post the log here?

On Mon, Dec 9, 2019 at 6:10 PM KevinDuringWork notifications@github.com wrote:

I have an upstream process that produces a VCF that is

  1. Not sorted
  2. Missing FILTER: PASS
  3. uncompressed

Scenario:

I need to run Picard:FixVcfHeader as a prerequisite to Picard:SortVcf as well as further annotation VCFAnno. on an uncompressed VCF (80MB - 130MB). FixVcfHeader appears to fix the header immediately but stalls on "Writing the output VCF" and eventually exhausts heap space. Description:

As a quick sanity check I've added in the header manually + sorted (SortVcf) then removed the header manually and ran FixVcfHeader on the resulting VCF file and it complete in less than a minute. Analysis:

My best guesses: CREATE_INDEX=false may be ignored and Picard is creating an index on sufficiently random ordered variants can cause a kind of "zip bomb" effect to occur. My resources are scarce and any process even approaching 8GB - 16GB gets automatically OOM by the OS.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/broadinstitute/picard/issues/1438?email_source=notifications&email_token=AAU6JUUNKSTCVVUSJJEOAQLQXZUWJA5CNFSM4JYLVCVKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4H7D5WTQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUTD4QJIHRTIVQD5VLDQXZUWJANCNFSM4JYLVCVA .

KevinDuringWork commented 4 years ago

image of cmd line https://pbs.twimg.com/media/ELLoHkVXkAUNt_P?format=png&name=4096x4096

CREATE_INDEX=false is the default, why is IndexingVariantContextWriter invoked?