broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
160 stars 71 forks source link

ScramblePart1 139 error code #652

Open MattWellie opened 4 months ago

MattWellie commented 4 months ago

Bug Report

Affected module(s) or script(s)

wdl/GatherSampleEvidence

Affected version(s)

This codebase as-of this commit

Description

I've seen a couple of failing jobs recently where ScramblePart1 is terminating with a 139 error code. It looks like this is a kill signal from the OS/Hypervisor when the tool tries to access memory it has no permission to use. This is always coupled with a Wham failure - is the 139 error likely to be meaningful, or is this potentially a kill signal as a separate part of the workflow had failed, and so all jobs needed to be stopped? n.b. these samples had been running for an entire week on this variant calling stage, which typically takes only a few hours, though the trace below is from a re-run, which picked up the prior run's results. e.g.

Jobs:
  [#] LocalizeReads (26s)
    Call caching: true
  [!] Whamg (5h:19m:10s)
    stdout: None
    stderr: None
    rc: None
    error: Workflow failed, caused by: Task Whamg.RunWhamgOnCram:NA:2 failed. Job exit code 137. (...)
  [#] CollectCounts (18s)
    Call caching: true
  [#] Manta (33s)
  [#] CollectSVEvidence (26s)
  [!] Scramble (1h:0m:57s)
    stdout: None
    stderr: None
    rc: None
    error: Workflow failed, caused by: Job Scramble.ScramblePart1:NA:2 exited with return code 139 which has not been declared as a valid return code
mwalker174 commented 4 months ago

Hi @MattWellie, that is unusual and not something we've run into before. Is it possible this is an issue with the CRAM? I would maybe try recompressing with the latest samtools and see if that resolves the issue.