The PostCommit Python job is flaky

github-actions[bot] commented 8 months ago

The PostCommit Python is failing over 50% of the time Please visit https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=is%3Afailure+branch%3Amaster to see the logs.

shunping commented 8 months ago

It first failed on https://github.com/apache/beam/actions/runs/8210266873.

The failed task is :sdks:python:test-suites:portable:py38:portableWordCountSparkRunnerBatch.

Traceback:

INFO:apache_beam.utils.subprocess_server:Starting service with ('java' '-jar' '/runner/_work/beam/beam/runners/spark/3/job-server/build/libs/beam-runners-spark-3-job-server-2.56.0-SNAPSHOT.jar' '--spark-master-url' 'local[4]' '--artifacts-dir' '/tmp/beam-temp8q8022zi/artifactsg6e8usou' '--job-port' '56313' '--artifact-port' '0' '--expansion-port' '0')
INFO:apache_beam.utils.subprocess_server:Error: A JNI error has occurred, please check your installation and try again
INFO:apache_beam.utils.subprocess_server:Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/beam/vendor/grpc/v1p60p1/io/grpc/BindableService
INFO:apache_beam.utils.subprocess_server:   at java.lang.ClassLoader.defineClass1(Native Method)
INFO:apache_beam.utils.subprocess_server:   at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
INFO:apache_beam.utils.subprocess_server:   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
INFO:apache_beam.utils.subprocess_server:   at java.net.URLClassLoader.defineClass(URLClassLoader.java:473)
INFO:apache_beam.utils.subprocess_server:   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
INFO:apache_beam.utils.subprocess_server:   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
INFO:apache_beam.utils.subprocess_server:   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
INFO:apache_beam.utils.subprocess_server:   at java.security.AccessController.doPrivileged(Native Method)
INFO:apache_beam.utils.subprocess_server:   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
INFO:apache_beam.utils.subprocess_server:   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
INFO:apache_beam.utils.subprocess_server:   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
INFO:apache_beam.utils.subprocess_server:   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
INFO:apache_beam.utils.subprocess_server:   at java.lang.Class.getDeclaredMethods0(Native Method)
INFO:apache_beam.utils.subprocess_server:   at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
INFO:apache_beam.utils.subprocess_server:   at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
INFO:apache_beam.utils.subprocess_server:   at java.lang.Class.getMethod0(Class.java:3018)
INFO:apache_beam.utils.subprocess_server:   at java.lang.Class.getMethod(Class.java:1784)
INFO:apache_beam.utils.subprocess_server:   at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:670)
INFO:apache_beam.utils.subprocess_server:   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:652)
INFO:apache_beam.utils.subprocess_server:Caused by: java.lang.ClassNotFoundException: org.apache.beam.vendor.grpc.v1p60p1.io.grpc.BindableService
INFO:apache_beam.utils.subprocess_server:   at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
INFO:apache_beam.utils.subprocess_server:   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
INFO:apache_beam.utils.subprocess_server:   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
INFO:apache_beam.utils.subprocess_server:   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
INFO:apache_beam.utils.subprocess_server:   ... 19 more
ERROR:apache_beam.utils.subprocess_server:Started job service with ('java', '-jar', '/runner/_work/beam/beam/runners/spark/3/job-server/build/libs/beam-runners-spark-3-job-server-2.56.0-SNAPSHOT.jar', '--spark-master-url', 'local[4]', '--artifacts-dir', '/tmp/beam-temp8q8022zi/artifactsg6e8usou', '--job-port', '56313', '--artifact-port', '0', '--expansion-port', '0')
ERROR:apache_beam.utils.subprocess_server:Error bringing up service
Traceback (most recent call last):
  File "/runner/_work/beam/beam/sdks/python/apache_beam/utils/subprocess_server.py", line 175, in start
    raise RuntimeError(
RuntimeError: Service failed to start up with error 1
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/examples/wordcount.py", line 111, in <module>
    run()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/examples/wordcount.py", line 106, in run
    output | 'Write' >> WriteToText(known_args.output)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/pipeline.py", line 612, in __exit__
    self.result = self.run()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/pipeline.py", line 586, in run
    return self.runner.run_pipeline(self, self._options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/runner.py", line 192, in run_pipeline
    return self.run_portable_pipeline(
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/portable_runner.py", line 381, in run_portable_pipeline
    job_service_handle = self.create_job_service(options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/portable_runner.py", line 296, in create_job_service
    return self.create_job_service_handle(server.start(), options)
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/job_server.py", line 81, in start
    self._endpoint = self._job_server.start()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/runners/portability/job_server.py", line 110, in start
    return self._server.start()
  File "/runner/_work/beam/beam/sdks/python/apache_beam/utils/subprocess_server.py", line 175, in start
    raise RuntimeError(
RuntimeError: Service failed to start up with error 1
> Task :sdks:python:test-suites:portable:py38:portableWordCountSparkRunnerBatch FAILED

shunping commented 8 months ago

Added the owner of the commit whose post-commit job failed at the first time. @damccorm

damccorm commented 8 months ago

I think we can pretty comfortably rule out that change, it was to the yaml sdk which is unrelated to portableWordCountSparkRunnerBatch. Note that this runs on a schedule, not on commits, though none of the commits in that scheduled time look particularly harmful

shunping commented 8 months ago

I see. It was red for the last two weeks and flaky before that too.

kennknowles commented 6 months ago

Permared right now

damccorm commented 6 months ago

Only sorta - each component job is actually not permared - e.g. there are 2 successes here, https://github.com/apache/beam/actions/runs/8873798546

The whole workflow is permared just because our flake percentage is so high

kennknowles commented 6 months ago

Yea, let's work out how to get top-level signal.

Abacn commented 6 months ago

The lowest and highest Python version (3.8, 3.11) are running more tests than (3.9, 3.10), could be those tests or task permared

kennknowles commented 6 months ago

Could make sense to find a way to get separate top-level signal for Python versions, assuming we can use software engineering to share everything necessary so they don't get out of sync.

Abacn commented 6 months ago

Yeah, we used to have this for Jenkins where each Python PostCommit had its own task

liferoad commented 5 months ago

The Vertex AI package version issue (we do not import this directly. So it should be fine.):


../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
-- | --
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | ../../build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33 |  
  | /runner/_work/beam/beam/build/gradleenv/-1734967050/lib/python3.9/site-packages/vertexai/preview/developer/__init__.py:33: DeprecationWarning: |  
  | After May 30, 2024, importing any code below will result in an error. |  
  | Please verify that you are explicitly pinning to a version of `google-cloud-aiplatform` |  
  | (e.g., google-cloud-aiplatform==[1.32.0, 1.49.0]) if you need to continue using this |  
  | library. |  
  |   |  
  | from vertexai.preview import ( |  
  | init, |  
  | remote, |  
  | VertexModel, |  
  | register, |  
  | from_pretrained, |  
  | developer, |  
  | hyperparameter_tuning, |  
  | tabular_models, |  
  | ) |  
  |

liferoad commented 5 months ago

A new flaky test in py39 and this is related to https://github.com/apache/beam/issues/29617:

https://ge.apache.org/s/hb7syztoolfhu/console-log?page=17


=================================== FAILURES =================================== |  
-- | --
  | [31m[1m_______________ BigQueryQueryToTableIT.test_big_query_legacy_sql _______________[0m |  
  | [gw3] linux -- Python 3.9.19 /runner/_work/beam/beam/build/gradleenv/1398941893/bin/python3.9 |  
  |   |  
  | self = <apache_beam.io.gcp.big_query_query_to_table_it_test.BigQueryQueryToTableIT testMethod=test_big_query_legacy_sql> |  
  |   |  
  | [37m@pytest[39;49;00m.mark.it_postcommit[90m[39;49;00m |  
  | [94mdef[39;49;00m [92mtest_big_query_legacy_sql[39;49;00m([96mself[39;49;00m):[90m[39;49;00m |  
  | verify_query = DIALECT_OUTPUT_VERIFY_QUERY % [96mself[39;49;00m.output_table[90m[39;49;00m |  
  | expected_checksum = test_utils.compute_hash(DIALECT_OUTPUT_EXPECTED)[90m[39;49;00m |  
  | pipeline_verifiers = [[90m[39;49;00m |  
  | PipelineStateMatcher(),[90m[39;49;00m |  
  | BigqueryMatcher([90m[39;49;00m |  
  | project=[96mself[39;49;00m.project,[90m[39;49;00m |  
  | query=verify_query,[90m[39;49;00m |  
  | checksum=expected_checksum)[90m[39;49;00m |  
  | ][90m[39;49;00m |  
  | [90m[39;49;00m |  
  | extra_opts = {[90m[39;49;00m |  
  | [33m'[39;49;00m[33mquery[39;49;00m[33m'[39;49;00m: LEGACY_QUERY,[90m[39;49;00m |  
  | [33m'[39;49;00m[33moutput[39;49;00m[33m'[39;49;00m: [96mself[39;49;00m.output_table,[90m[39;49;00m |  
  | [33m'[39;49;00m[33moutput_schema[39;49;00m[33m'[39;49;00m: DIALECT_OUTPUT_SCHEMA,[90m[39;49;00m |  
  | [33m'[39;49;00m[33muse_standard_sql[39;49;00m[33m'[39;49;00m: [94mFalse[39;49;00m,[90m[39;49;00m |  
  | [33m'[39;49;00m[33mwait_until_finish_duration[39;49;00m[33m'[39;49;00m: WAIT_UNTIL_FINISH_DURATION_MS,[90m[39;49;00m |  
  | [33m'[39;49;00m[33mon_success_matcher[39;49;00m[33m'[39;49;00m: all_of(*pipeline_verifiers),[90m[39;49;00m |  
  | }[90m[39;49;00m |  
  | options = [96mself[39;49;00m.test_pipeline.get_full_options_as_args(**extra_opts)[90m[39;49;00m |  
  | >     big_query_query_to_table_pipeline.run_bq_pipeline(options)[90m[39;49;00m |  
  |   |  
  | [1m[31mapache_beam/io/gcp/big_query_query_to_table_it_test.py[0m:178: |  
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |  
  | [1m[31mapache_beam/io/gcp/big_query_query_to_table_pipeline.py[0m:103: in run_bq_pipeline |  
  | result = p.run()[90m[39;49;00m |  
  | [1m[31mapache_beam/testing/test_pipeline.py[0m:115: in run |  
  | result = [96msuper[39;49;00m().run([90m[39;49;00m |  
  | [1m[31mapache_beam/pipeline.py[0m:560: in run |  
  | [94mreturn[39;49;00m Pipeline.from_runner_api([90m[39;49;00m |  
  | [1m[31mapache_beam/pipeline.py[0m:587: in run |  
  | [94mreturn[39;49;00m [96mself[39;49;00m.runner.run_pipeline([96mself[39;49;00m, [96mself[39;49;00m._options)[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/direct/test_direct_runner.py[0m:42: in run_pipeline |  
  | [96mself[39;49;00m.result = [96msuper[39;49;00m().run_pipeline(pipeline, options)[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/direct/direct_runner.py[0m:117: in run_pipeline |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mportability[39;49;00m[04m[96m.[39;49;00m[04m[96mfn_api_runner[39;49;00m [94mimport[39;49;00m fn_runner[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/portability/fn_api_runner/__init__.py[0m:18: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mportability[39;49;00m[04m[96m.[39;49;00m[04m[96mfn_api_runner[39;49;00m[04m[96m.[39;49;00m[04m[96mfn_runner[39;49;00m [94mimport[39;49;00m FnApiRunner[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/portability/fn_api_runner/fn_runner.py[0m:68: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mportability[39;49;00m[04m[96m.[39;49;00m[04m[96mfn_api_runner[39;49;00m [94mimport[39;49;00m execution[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/portability/fn_api_runner/execution.py[0m:62: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mportability[39;49;00m[04m[96m.[39;49;00m[04m[96mfn_api_runner[39;49;00m [94mimport[39;49;00m translations[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/portability/fn_api_runner/translations.py[0m:55: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mworker[39;49;00m [94mimport[39;49;00m bundle_processor[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/worker/bundle_processor.py[0m:69: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mworker[39;49;00m [94mimport[39;49;00m operations[90m[39;49;00m |  
  | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ |  
  |   |  
  | >   [04m[91m?[39;49;00m[04m[91m?[39;49;00m[04m[91m?[39;49;00m[90m[39;49;00m |  
  | [1m[31mE   KeyError: '__pyx_vtable__'[0m |  
  |   |  
  | [1m[31mapache_beam/runners/worker/operations.py[0m:1: KeyError

liferoad commented 5 months ago

Last three runs are green now.

Close this for now.

shunping commented 5 months ago

Great. Thanks @liferoad

github-actions[bot] commented 5 months ago

Reopening since the workflow is still flaky

liferoad commented 5 months ago

New error:


==================================== ERRORS ==================================== |  
-- | --
  | [31m[1m________________ ERROR at setup of ReadTests.test_native_source ________________[0m |  
  | [gw5] linux -- Python 3.9.19 /runner/_work/beam/beam/build/gradleenv/1398941893/bin/python3.9 |  
  |   |  
  | self = <apache_beam.io.gcp.bigquery_tools.BigQueryWrapper object at 0x7f248f59baf0> |  
  | project_id = 'apache-beam-testing' |  
  | dataset_id = 'python_read_table_17178042710ffd3b', location = None |  
  | labels = None |  
  |   |  
  | [37m@retry[39;49;00m.with_exponential_backoff([90m[39;49;00m |  
  | num_retries=MAX_RETRIES,[90m[39;49;00m |  
  | retry_filter=retry.retry_on_server_errors_and_timeout_filter)[90m[39;49;00m |  
  | [94mdef[39;49;00m [92mget_or_create_dataset[39;49;00m([90m[39;49;00m |  
  | [96mself[39;49;00m, project_id, dataset_id, location=[94mNone[39;49;00m, labels=[94mNone[39;49;00m):[90m[39;49;00m |  
  | [90m# Check if dataset already exists otherwise create it[39;49;00m[90m[39;49;00m |  
  | [94mtry[39;49;00m:[90m[39;49;00m |  
  | >       dataset = [96mself[39;49;00m.client.datasets.Get([90m[39;49;00m |  
  | bigquery.BigqueryDatasetsGetRequest([90m[39;49;00m |  
  | projectId=project_id, datasetId=dataset_id))[90m[39;49;00m |  
  |   |  
  | [1m[31mapache_beam/io/gcp/bigquery_tools.py[0m:809:

kennknowles commented 5 months ago

I looked at a couple flakes and could not discern if they represented anything that should be release blocking, so I am moving this to the next release milestone.

liferoad commented 5 months ago

Green for last two days.

github-actions[bot] commented 5 months ago

Reopening since the workflow is still flaky

liferoad commented 5 months ago


[31m[1m_______ ERROR collecting apache_beam/runners/worker/log_handler_test.py ________[0m |  
-- | --
  | [1m[31mapache_beam/runners/worker/log_handler_test.py[0m:34: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mworker[39;49;00m [94mimport[39;49;00m bundle_processor[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/worker/bundle_processor.py[0m:69: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mworker[39;49;00m [94mimport[39;49;00m operations[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/worker/operations.py[0m:1: in init apache_beam.runners.worker.operations |  
  | [04m[91m?[39;49;00m[04m[91m?[39;49;00m[04m[91m?[39;49;00m[90m[39;49;00m |  
  | [1m[31mE   KeyError: '__pyx_vtable__'[0m |  
  | [31m[1m________ ERROR collecting apache_beam/runners/worker/opcounters_test.py ________[0m |  
  | [1m[31mapache_beam/runners/worker/opcounters_test.py[0m:27: in <module> |  
  | [94mfrom[39;49;00m [04m[96mapache_beam[39;49;00m[04m[96m.[39;49;00m[04m[96mrunners[39;49;00m[04m[96m.[39;49;00m[04m[96mworker[39;49;00m [94mimport[39;49;00m opcounters[90m[39;49;00m |  
  | [1m[31mapache_beam/runners/worker/opcounters.py[0m:1: in init apache_beam.runners.worker.opcounters |  
  | [04m[91m?[39;49;00m[04m[91m?[39;49;00m[04m[91m?[39;49;00m[90m[39;49;00m |  
  | [1m[31mE   ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject[0m

https://ge.apache.org/s/w6kem3hrdnwii/console-log/task/:sdks:python:test-suites:direct:py38:tensorflowInferenceTest?anchor=1334&page=2


[36m[1m=========================== short test summary info ============================[0m |  
-- | --
  | [31mERROR[0m apache_beam/dataframe/transforms_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/dataframe/transforms_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/render_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/render_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/trivial_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/trivial_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/dataflow/dataflow_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/dataflow/dataflow_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/interactive/interactive_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/interactive/interactive_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/interactive/utils_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/interactive/utils_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/flink_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/flink_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/flink_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/flink_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/local_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/local_job_service_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/portable_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/portable_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/samza_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/samza_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/spark_java_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/spark_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/spark_java_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/spark_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/spark_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/spark_uber_jar_job_server_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/fn_api_runner/fn_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/fn_api_runner/fn_runner_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/fn_api_runner/translations_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/fn_api_runner/translations_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/portability/fn_api_runner/trigger_manager_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/bundle_processor_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/log_handler_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/opcounters_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | [31mERROR[0m apache_beam/runners/portability/fn_api_runner/trigger_manager_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/bundle_processor_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/log_handler_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/opcounters_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | [31mERROR[0m apache_beam/runners/worker/sdk_worker_main_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/sdk_worker_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/sideinputs_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | [31mERROR[0m apache_beam/runners/worker/sdk_worker_main_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/sdk_worker_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/runners/worker/sideinputs_test.py - ValueError: apache_beam.utils.counters.Counter size changed, may indicate binary incompatibility. Expected 56 from C header, got 32 from PyObject |  
  | [31mERROR[0m apache_beam/testing/load_tests/microbenchmarks_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/transforms/combinefn_lifecycle_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/testing/load_tests/microbenchmarks_test.py - KeyError: '__pyx_vtable__' |  
  | [31mERROR[0m apache_beam/transforms/combinefn_lifecycle_test.py - KeyError: '__pyx_vtable__'

jrmccluskey commented 4 months ago

No cython issues in recent runs, just a number of flakes for tests with external connections (GCSIO, RRIO) that aren't consistent across Python versions or different runs

Abacn commented 3 months ago

Currently Python3.12 Dataflow test has two test failing consistently:

apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_mnist_classification 

apache_beam/ml/inference/sklearn_inference_it_test.py::SklearnInference::test_sklearn_mnist_classification_large_model

Error:

 subprocess.CalledProcessError: Command '['/runner/_work/beam/beam/build/gradleenv/2050596100/bin/python3.12', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/tmp/tmpoq1ebvgy/tmp_requirements.txt', '--exists-action', 'i', '--no-deps', '--implementation', 'cp', '--abi', 'cp312', '--platform', 'manylinux2014_x86_64']' returned non-zero exit status 1.

Error compiling Cython file:

sklearn/utils/_vector_sentinel.pyx:31:9: Previous declaration is here

Cannot install sklearn from source using cython

happened as early as https://github.com/apache/beam/commits/5b2bfe96f83a5631c3a8d5c3b92a0f695ffe2d7d

Abacn commented 3 months ago

We need bump sklearn requirements here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/inference/sklearn_examples_requirements.txt

github-actions[bot] commented 3 months ago

Reopening since the workflow is still flaky

github-actions[bot] commented 2 months ago

Reopening since the workflow is still flaky

liferoad commented 2 months ago

2024-08-30T07:28:39.6571287Z if setup_options.setup_file is not None: 2024-08-30T07:28:39.6571763Z if not os.path.isfile(setup_options.setup_file): 2024-08-30T07:28:39.6572227Z > raise RuntimeError( 2024-08-30T07:28:39.6572923Z 'The file %s cannot be found. It was specified in the ' 2024-08-30T07:28:39.6573578Z '--setup_file command line option.' % setup_options.setup_file) 2024-08-30T07:28:39.6574970Z [1m[31mE RuntimeError: The file /runner/_work/beam/beam/sdks/python/apache_beam/examples/complete/juliaset/src/setup.py cannot be found. It was specified in the --setup_file command line option.[0m

https://productionresultssa6.blob.core.windows.net/actions-results/9f18d66f-dabf-46e8-8b29-ae50d075f3dd/workflow-job-run-912db29d-d57b-5850-6efb-b125ca814b95/logs/job/job-logs.txt?rsct=text%2Fplain&se=2024-08-30T14%3A06%3A43Z&sig=aqESnfP68oo0sF7TUtpq%2BNFgdgfCbq8Ey3q%2BFMLZtvI%3D&ske=2024-08-31T00%3A21%3A54Z&skoid=ca7593d4-ee42-46cd-af88-8b886a2f84eb&sks=b&skt=2024-08-30T12%3A21%3A54Z&sktid=398a6654-997b-47e9-b12b-9515b896b4de&skv=2024-05-04&sp=r&spr=https&sr=b&st=2024-08-30T13%3A56%3A38Z&sv=2024-05-04

tvalentyn commented 2 months ago

Currently failing test:

gradlew :sdks:python:test-suites:portable:py312:portableLocalRunnerJuliaSetWithSetupPy

damccorm commented 2 weeks ago

This is red again - https://github.com/apache/beam/actions/workflows/beam_PostCommit_Python.yml?query=branch%3Amaster

It looks like there are currently 2 issues:

Python 3.9 job is failing, I think probably because of the mypy changes. example failure
The TensorRT tests are failing. Originally, they were failing because of a mismatch between container/local python versions, but now they seem to be running into CUDA issues with the new container. example failure and corresponding failing Dataflow job&e=13802955&mods=dataflow_dev)

damccorm commented 2 weeks ago

@jrmccluskey would you mind taking a look at these?

jrmccluskey commented 2 weeks ago

Failure in the 3.9 postcommit is apache_beam/examples/fastavro_it_test.py::FastavroIT::test_avro_it, will dive deeper into that shortly

jrmccluskey commented 2 weeks ago

The problem in the TensorRT container is that we seem to have two different versions of CUDA installed, one at version 11.8 and the other at 12.1 (we want everything at 12.1)

damccorm commented 1 week ago

Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner.

@Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to https://github.com/apache/beam/pull/32976 (and maybe some caching kept it from showing up?)

Abacn commented 1 week ago

Looks like after sickbaying TensorRT tests, there are still failures. https://ge.apache.org/s/27igat7sfmcsu/console-log/task/:sdks:python:test-suites:portable:py310:portableWordCountSparkRunnerBatch?anchor=60&page=1 is an example, it looks like we're failing because we're missing a class in the spark runner.

@Abacn would you mind taking a look? Its unclear why this is happening now, but I'm guessing it may be related to #32976 (and maybe some caching kept it from showing up?)

It's bad gradle cache. Cannot reproduce locally on master branch. Also inspected the expansion jar.

For some reason, recently, Gradle cache for shadowJar breaks more frequently

shunping commented 5 days ago

It started to fail last week again (Friday days ago) since the distroless python sdk PR: https://github.com/apache/beam/commit/81f35ab62298a2ec9fadeded82461b363b6401db (@damondouglas)

#21 [distroless 5/6] COPY --from=base /usr/lib/python3.9 /usr/lib/python3.9 |  
-- | --
  | #21 ERROR: failed to calculate checksum of ref 21e0551f-9179-41a9-b6c7-d487e40b7288::4b5lek0fokkw0omzyb94t5h7y: "/usr/lib/python3.9": not found

shunping commented 5 days ago

There is no /usr/lib/python3.9 under in the image python:3.9-bookworm. I can only see python3 and python3.11 folders there, and I think we may need to copy the python3 one.

$ docker run -it python:3.9-bookworm bash
root@b730cccba5a8:/# ls -d /usr/lib/python*
/usr/lib/python3  /usr/lib/python3.11

root@b730cccba5a8:/# ls -d /usr/local/lib/python*
/usr/local/lib/python3.11  /usr/local/lib/python3.9

@damondouglas , could you confirm that?

damccorm commented 5 days ago

@shunping I think Damon is on vacation, if there is a quick fix please go ahead and apply it, otherwise could you please revert and we can try again when Damon is back/after the 2.61.0 release

cc/ @Abacn

shunping commented 5 days ago

sg, will see if the fix in my mind will can work.

shunping commented 5 days ago

Ok, take another look at this. The test started to fail at 11/06 6:32PM (https://github.com/apache/beam/actions/runs/11713650994), the last successful run was at 11/06 12:33PM (https://github.com/apache/beam/actions/runs/11708854671). There are two commits during this time internal:

Distro Python SDK: https://github.com/apache/beam/commit/81f35ab62298a2ec9fadeded82461b363b6401db, which causes the previously mentioned error during docker image building (:sdks:python:container:py39:docker)
Kafka: https://github.com/apache/beam/commit/eeebae1bda6b211463e53a4e4ca469bfa9763399, which seems to be the reason for failure (:sdks:python:test-suites:portable:py39:postCommitPy39IT).

The Kafka error message is shown below:

FAILED apache_beam/io/external/xlang_kafkaio_it_test.py::CrossLanguageKafkaIOTest::test_local_kafkaio_populated_key - RuntimeError: Pipeline BeamApp-runner-1111115329-514dd26a_03822608-80d0-4037-bc13-11d632204f46 failed in state FAILED: java.lang.RuntimeException: Error received from SDK harness for instruction 3: org.apache.beam.sdk.util.UserCodeException: java.io.IOException: KafkaWriter : failed to send 1 records (since last report)
    at org.apache.beam.sdk.util.UserCodeException.wrap(UserCodeException.java:39)
    at org.apache.beam.sdk.io.kafka.KafkaWriter$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:810)
    at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348)
    at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275)
    at org.apache.beam.fn.harness.FnApiDoFnRunner.outputTo(FnApiDoFnRunner.java:1837)
    at org.apache.beam.fn.harness.FnApiDoFnRunner.access$3100(FnApiDoFnRunner.java:145)
    at org.apache.beam.fn.harness.FnApiDoFnRunner$NonWindowObservingProcessBundleContext.output(FnApiDoFnRunner.java:2695)
    at org.apache.beam.sdk.transforms.MapElements$2.processElement(MapElements.java:151)
    at org.apache.beam.sdk.transforms.MapElements$2$DoFnInvoker.invokeProcessElement(Unknown Source)
    at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:810)
    at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:348)
    at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:275)
    at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:213)
    at org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.multiplexElements(BeamFnDataInboundObserver.java:172)
    at org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:136)
    at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:550)
    at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:150)
    at org.apache.beam.fn.harness.control.BeamFnControlClient$InboundObserver.lambda$onNext$0(BeamFnControlClient.java:115)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at org.apache.beam.sdk.util.UnboundedScheduledExecutorService$ScheduledFutureTask.run(UnboundedScheduledExecutorService.java:163)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: KafkaWriter : failed to send 1 records (since last report)
    at org.apache.beam.sdk.io.kafka.KafkaWriter.checkForFailures(KafkaWriter.java:183)
    at org.apache.beam.sdk.io.kafka.KafkaWriter.processElement(KafkaWriter.java:66)
Caused by: org.apache.kafka.common.errors.TimeoutException: Topic xlang_kafkaio_test_populated_key_e9df3a07-037f-45a1-afde-7cea599f9570 not present in metadata after 60000 ms.

@Abacn , could you check this and see if we need to roll it back?

Abacn commented 5 days ago

Thanks for taking care of it. I am +1 for rollback. The first distroless PR was expected to be a no-op for 2.61.0 release. Good to know it broke something before release cut.

apache / beam

The PostCommit Python job is flaky #30513