broadinstitute / genetic-prevalence-estimator

https://genie.broadinstitute.org/
BSD 3-Clause "New" or "Revised" License
1 stars 0 forks source link

Errors and long queuing #149

Open sambaxter opened 1 year ago

sambaxter commented 1 year ago

I needed to create the variant lists for ll the CZI and groups and I set up 28 lists in rapid succession. The last 8 seem to be stuck 5 in the queue and 3 are giving back errors.

The 5 stuck in the queue are: https://genie.broadinstitute.org/variant-lists/d3d8d871-842d-47d5-8602-c88fbb55b183/ https://genie.broadinstitute.org/variant-lists/9f811c70-3f18-4557-adc3-dd6f4c2d34af/ https://genie.broadinstitute.org/variant-lists/82551a9f-3872-43a9-ad85-4dde00e82c83/ https://genie.broadinstitute.org/variant-lists/deba0245-af4f-46da-a58b-f0311eb9ce68/ https://genie.broadinstitute.org/variant-lists/466a984a-b7e4-4a0b-9cf6-7955faa80cf5/

The 3 with errors are: https://genie.broadinstitute.org/variant-lists/8d727ca5-e593-4f5d-940b-e0166dac6808/ https://genie.broadinstitute.org/variant-lists/058e3b81-6722-4f36-9390-b615d74403d0/ https://genie.broadinstitute.org/variant-lists/9d91a313-fc8f-4fad-93c1-786f326e3ec7/

The error text returned is: Error details

Traceback (most recent call last):
  File "/app/worker/src/worker/tasks.py", line 416, in process_variant_list
    _process_variant_list(variant_list)
  File "/app/worker/src/worker/tasks.py", line 353, in _process_variant_list
    gnomad = hl.read_table(
  File "<decorator-gen-1362>", line 2, in read_table
  File "/usr/local/lib/python3.9/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.9/site-packages/hail/methods/impex.py", line 2576, in read_table
    ht = Table(ir.TableRead(tr, False))
  File "/usr/local/lib/python3.9/site-packages/hail/table.py", line 343, in __init__
    self._type = self._tir.typ
  File "/usr/local/lib/python3.9/site-packages/hail/ir/base_ir.py", line 339, in typ
    self._compute_type()
  File "/usr/local/lib/python3.9/site-packages/hail/ir/table_ir.py", line 248, in _compute_type
    self._type = Env.backend().table_type(self)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 288, in table_type
    jir = self._to_java_table_ir(tir)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 275, in _to_java_table_ir
    return self._to_java_ir(ir, self._parse_table_ir)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 268, in _to_java_ir
    ir._jir = parse(r(ir), ir_map=r.jirs)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 243, in _parse_table_ir
    return self._jbackend.parse_table_ir(code, ref_map, ir_map)
  File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/usr/local/lib/python3.9/site-packages/hail/backend/py4j_backend.py", line 29, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: OutOfMemoryError: Java heap space

Java stack trace:
java.lang.OutOfMemoryError: Java heap space
    at 

Hail version: 0.2.85-9b98676b6ad8
Error summary: OutOfMemoryError: Java heap space
sambaxter commented 1 year ago

I tried to create a list via the recommended process (https://genie.broadinstitute.org/variant-lists/3d87a5f8-f350-4c20-a471-104c9c2c2bdf/) and got a similar but slightly different error

Traceback (most recent call last):
  File "/app/worker/src/worker/tasks.py", line 416, in process_variant_list
    _process_variant_list(variant_list)
  File "/app/worker/src/worker/tasks.py", line 344, in _process_variant_list
    recommended_variants = get_recommended_variants(metadata, transcript)
  File "/app/worker/src/worker/tasks.py", line 197, in get_recommended_variants
    ds = hl.read_table(
  File "<decorator-gen-1362>", line 2, in read_table
  File "/usr/local/lib/python3.9/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/usr/local/lib/python3.9/site-packages/hail/methods/impex.py", line 2576, in read_table
    ht = Table(ir.TableRead(tr, False))
  File "/usr/local/lib/python3.9/site-packages/hail/table.py", line 343, in __init__
    self._type = self._tir.typ
  File "/usr/local/lib/python3.9/site-packages/hail/ir/base_ir.py", line 339, in typ
    self._compute_type()
  File "/usr/local/lib/python3.9/site-packages/hail/ir/table_ir.py", line 248, in _compute_type
    self._type = Env.backend().table_type(self)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 288, in table_type
    jir = self._to_java_table_ir(tir)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 275, in _to_java_table_ir
    return self._to_java_ir(ir, self._parse_table_ir)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 268, in _to_java_ir
    ir._jir = parse(r(ir), ir_map=r.jirs)
  File "/usr/local/lib/python3.9/site-packages/hail/backend/spark_backend.py", line 243, in _parse_table_ir
    return self._jbackend.parse_table_ir(code, ref_map, ir_map)
  File "/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1304, in __call__
    return_value = get_return_value(
  File "/usr/local/lib/python3.9/site-packages/hail/backend/py4j_backend.py", line 29, in deco
    raise FatalError('%s\n\nJava stack trace:\n%s\n'
hail.utils.java.FatalError: OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects

Java stack trace:
java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
    at 

Hail version: 0.2.85-9b98676b6ad8
Error summary: OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
nawatts commented 1 year ago

Looking at Error Reporting, I think there a few things going on here...

  1. Hail may run out of memory processing a variant list. We should look into increasing the amount of memory allocated to the worker, and check that the system can handle TTN and other large genes.
  2. I suspect that when an error like this causes Hail's Java backend to fail, the Python part of the worker keeps running but then can't process any more variant lists. Maybe we should check if Hail is still working after processing a variant list fails and if it's not, restart Hail or the worker.
  3. There are some error logs "The request was aborted because there was no available instance". This seems like something that may be related to many variant lists created in a short time. We should look into the PubSub configuration and how long requests will wait / how many times they'll be retried. There is a "dead letter" policy configured, but I'm not sure if anything is set up to handle requests sent to that topic. I suspect this is how variant lists ended up stuck in the "Queued" state.
  4. We should also set up a notification channel for Error Reporting, so that we get notified about errors like this instead of relying on users to report them.