update_subgraph_counts.py fails for ANVIL_T2T_CHRY dataset

achave11-ucsc commented 6 months ago

… when the hammerbox deployment is selected.

In order to reproduce, apply the following patch to add the ANVIL_T2T_CHRY_20240301_ANV5_202403040508 dataset…

Index: deployments/anvilprod/environment.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/deployments/anvilprod/environment.py b/deployments/anvilprod/environment.py
--- a/deployments/anvilprod/environment.py  (revision 8d5ac3a32d686421d11a50fef9d98e3e97e0f439)
+++ b/deployments/anvilprod/environment.py  (date 1714165556983)
@@ -383,7 +383,7 @@
     # @formatter:on
 ]))

-anvil6_sources = mkdict(anvil5_sources, 255, mkdelta([
+anvil6_sources = mkdict(anvil5_sources, 256, mkdelta([
     # @formatter:off
     mksrc('datarepo-38af6304', 'ANVIL_1000G_PRIMED_data_model_20240410_ANV5_202404101419', 14695),
     mksrc('datarepo-ed4892b5', 'ANVIL_ALSCompute_Collection_GRU_20231016_ANV5_202404081553', 14593),
@@ -616,6 +616,7 @@
     mksrc('datarepo-4a4eec27', 'ANVIL_PAGE_SoL_HMB_WGS_20221220_ANV5_202403040445', 234, pop),
     mksrc('datarepo-a1f917db', 'ANVIL_PAGE_Stanford_Global_Reference_Panel_GRU_WGS_20221128_ANV5_202403040453', 78, pop), # noqa E501
     mksrc('datarepo-6264931f', 'ANVIL_PAGE_WHI_HMB_IRB_WGS_20221019_ANV5_202403040500', 235, pop),
+    mksrc('datarepo-e5b16a5a', 'ANVIL_T2T_CHRY_20240301_ANV5_202403040508', 309979),
     mksrc('datarepo-f3817357', 'ANVIL_ccdg_asc_ndd_daly_talkowski_AGRE_asd_exome_20221102_ANV5_202403040528', 850),
     mksrc('datarepo-23635d1c', 'ANVIL_ccdg_asc_ndd_daly_talkowski_IBIS_asd_exome_20221024_ANV5_202403040537', 241),
     mksrc('datarepo-ecf311e7', 'ANVIL_ccdg_asc_ndd_daly_talkowski_TASC_asd_exome_20221117_ANV5_202403040544', 3175),
Index: deployments/hammerbox/environment.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/deployments/hammerbox/environment.py b/deployments/hammerbox/environment.py
--- a/deployments/hammerbox/environment.py  (revision 8d5ac3a32d686421d11a50fef9d98e3e97e0f439)
+++ b/deployments/hammerbox/environment.py  (date 1714165556983)
@@ -396,7 +396,7 @@
     # @formatter:on
 ]))

-anvil6_sources = mkdict(anvil5_sources, 255, mkdelta([
+anvil6_sources = mkdict(anvil5_sources, 256, mkdelta([
     # @formatter:off
     mksrc('datarepo-38af6304', 'ANVIL_1000G_PRIMED_data_model_20240410_ANV5_202404101419', 14695),
     mksrc('datarepo-ed4892b5', 'ANVIL_ALSCompute_Collection_GRU_20231016_ANV5_202404081553', 14593),
@@ -629,6 +629,7 @@
     mksrc('datarepo-4a4eec27', 'ANVIL_PAGE_SoL_HMB_WGS_20221220_ANV5_202403040445', 234, pop),
     mksrc('datarepo-a1f917db', 'ANVIL_PAGE_Stanford_Global_Reference_Panel_GRU_WGS_20221128_ANV5_202403040453', 78, pop), # noqa E501
     mksrc('datarepo-6264931f', 'ANVIL_PAGE_WHI_HMB_IRB_WGS_20221019_ANV5_202403040500', 235, pop),
+    mksrc('datarepo-e5b16a5a', 'ANVIL_T2T_CHRY_20240301_ANV5_202403040508', 309979),
     mksrc('datarepo-f3817357', 'ANVIL_ccdg_asc_ndd_daly_talkowski_AGRE_asd_exome_20221102_ANV5_202403040528', 850),
     mksrc('datarepo-23635d1c', 'ANVIL_ccdg_asc_ndd_daly_talkowski_IBIS_asd_exome_20221024_ANV5_202403040537', 241),
     mksrc('datarepo-ecf311e7', 'ANVIL_ccdg_asc_ndd_daly_talkowski_TASC_asd_exome_20221117_ANV5_202403040544', 3175),

… then run the following command with hammerbox selected:

python scripts/update_subgraph_counts.py --old-catalog anvil5 --catalog anvil6

The following is an excerpt of what that failed execution looks like (dropping the successful graph counts minus the last, for reference),

python scripts/update_subgraph_counts.py --old-catalog anvil5 --catalog anvil6
anvil6
------
…
'ANVIL_NIMH_CIRM_FCDI_ConvergentNeuro_McCarroll_Eggan_GRU_Arrays_20230109_ANV5_202402292215', 6510)
BigQuery job error during attempt 1/5. Retrying in 10s.
Traceback (most recent call last):
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/terra.py", line 493, in run_sql
    result = job.result()
             ^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1581, in result
    do_get_result()
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 366, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 204, in retry_target
    return target()
           ^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1571, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 922, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.InternalServerError: 500 An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 76907651

Location: us-central1
Job ID: a9848d65-24f0-4e93-8c7b-36ba2cdd532f

BigQuery job error during attempt 2/5. Retrying in 20s.
Traceback (most recent call last):
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/terra.py", line 493, in run_sql
    result = job.result()
             ^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1581, in result
    do_get_result()
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 366, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 204, in retry_target
    return target()
           ^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1571, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 922, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.InternalServerError: 500 An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 76907651

Location: us-central1
Job ID: c32c98cf-45a5-4da9-ad17-e35d2f665a04

BigQuery job error during attempt 3/5. Retrying in 40s.
Traceback (most recent call last):
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/terra.py", line 493, in run_sql
    result = job.result()
             ^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1581, in result
    do_get_result()
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 366, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 204, in retry_target
    return target()
           ^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1571, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 922, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.InternalServerError: 500 An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 76907651

Location: us-central1
Job ID: 9452ece6-2562-4dba-a3e7-178a7d389429

BigQuery job error during attempt 4/5. Retrying in 80s.
Traceback (most recent call last):
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/terra.py", line 493, in run_sql
    result = job.result()
             ^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1581, in result
    do_get_result()
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 366, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 204, in retry_target
    return target()
           ^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1571, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 922, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.InternalServerError: 500 An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 76907651

Location: us-central1
Job ID: 6ea60378-8bb8-40e5-be3a-448363fd3162

Traceback (most recent call last):
  File "/Users/achave11/Pycharm/Azul/azul.stable/scripts/update_subgraph_counts.py", line 187, in <module>
    main(sys.argv[1:])
  File "/Users/achave11/Pycharm/Azul/azul.stable/scripts/update_subgraph_counts.py", line 170, in main
    for spec_args in generate_sources(args.catalog, args.old_catalog):
  File "/Users/achave11/Pycharm/Azul/azul.stable/scripts/update_subgraph_counts.py", line 148, in generate_sources
    yield from tpe.map(generate_source, sources)
  File "/Users/achave11/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/achave11/.pyenv/versions/3.11.8/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/scripts/update_subgraph_counts.py", line 139, in generate_source
    prefixed_counter = SubgraphCounter.for_source(plugin, source, counter_prefix)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/scripts/update_subgraph_counts.py", line 71, in for_source
    return cls(partition_sizes=plugin.list_partitions(source))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/plugins/repository/tdr_anvil/__init__.py", line 252, in list_partitions
    rows = self._run_sql(f'''
           ^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/plugins/repository/tdr.py", line 225, in _run_sql
    return self.tdr.run_sql(query)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/terra.py", line 496, in run_sql
    raise e
  File "/Users/achave11/Pycharm/Azul/azul.stable/src/azul/terra.py", line 493, in run_sql
    result = job.result()
             ^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1581, in result
    do_get_result()
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 366, in retry_wrapped_func
    return retry_target(
           ^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/retry.py", line 204, in retry_target
    return target()
           ^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/query.py", line 1571, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/cloud/bigquery/job/base.py", line 922, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/achave11/Pycharm/Azul/azul.stable/.venv/lib/python3.11/site-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.InternalServerError: 500 An internal error occurred and the request could not be completed. This is usually caused by a transient issue. Retrying the job with back-off as described in the BigQuery SLA should solve the problem: https://cloud.google.com/bigquery/sla. If the error continues to occur please contact support at https://cloud.google.com/support. Error: 76907651

Location: us-central1
Job ID: f57726da-0b27-463f-bc08-6f504e8eb545

achave11-ucsc commented 6 months ago

Assignee to provide complete reproduction in the description of this issue.

dsotirho-ucsc commented 6 months ago

Spike for estimate and optionally design.

nadove-ucsc commented 6 months ago

The actual counting of subgraphs occurs in the RepositoryPlugin.list_partitions method. The current implementation of this method is egregiously inefficient due to a JOIN between every bundle in the snapshot and every possible prefix. For the failing snapshot, that's a total of 20,314,783,744 string comparisons.

This appears to be more than the BQ servers can handle. I couldn't find any more evidence as to the underlying cause of the error, but I re-implemented list_partitions to avoid this JOIN and the error now appears to be resolved.

The estimate for the implementation is covered by the spike.

nadove-ucsc commented 6 months ago

I've confirmed that the script's output has not changed as a result of the re-implementation.

hannes-ucsc commented 6 months ago

For demo, attempt to reproduce.

DataBiosphere / azul

update_subgraph_counts.py fails for ANVIL_T2T_CHRY dataset #6206