Open nkemnitz opened 9 months ago
Looks like it!
Getting this now extremely often when cold-starting a k8s cluster:
/home/nkemnitz/zetta/zetta_utils/venv-3.11) nkemnitz@Eriador:~/zetta/zetta_utils$ kubectl logs --previous hissing-piquant-bear-of-honeydew-554dfdd88f-zqjl9
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/bin/zetta:5 in <module> │
│ │
│ 2 # -*- coding: utf-8 -*- │
│ 3 import re │
│ 4 import sys │
│ ❱ 5 from zetta_utils.cli.main import cli │
│ 6 if __name__ == '__main__': │
│ 7 │ sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) │
│ 8 │ sys.exit(cli()) │
│ │
│ /opt/zetta_utils/zetta_utils/__init__.py:3 in <module> │
│ │
│ 1 # pylint: disable=unused-import, import-outside-toplevel │
│ 2 """Zetta AI Computational Connectomics Toolkit.""" │
│ ❱ 3 from . import log, typing, parsing, builder, common │
│ 4 from . import geometry, distributions, layer, ng │
│ 5 │
│ 6 builder.registry.MUTLIPROCESSING_INCOMPATIBLE_CLASSES.add("mazepa") │
│ │
│ /opt/zetta_utils/zetta_utils/parsing/__init__.py:2 in <module> │
│ │
│ 1 from . import cue │
│ ❱ 2 from . import ngl_state │
│ 3 from . import json │
│ 4 │
│ │
│ /opt/zetta_utils/zetta_utils/parsing/ngl_state.py:16 in <module> │
│ │
│ 13 │ make_layer, │
│ 14 ) │
│ 15 │
│ ❱ 16 from zetta_utils.geometry import BBox3D, Vec3D │
│ 17 from zetta_utils.log import get_logger │
│ 18 │
│ 19 logger = get_logger("zetta_utils") │
│ │
│ /opt/zetta_utils/zetta_utils/geometry/__init__.py:2 in <module> │
│ │
│ 1 from .vec import Vec3D, IntVec3D, RawVec3D │
│ ❱ 2 from .bbox import BBox3D │
│ 3 from .bbox_strider import BBoxStrider │
│ 4 │
│ │
│ /opt/zetta_utils/zetta_utils/geometry/bbox.py:10 in <module> │
│ │
│ 7 import attrs │
│ 8 from typeguard import typechecked │
│ 9 │
│ ❱ 10 from zetta_utils import builder │
│ 11 from zetta_utils.geometry.vec import VEC3D_PRECISION │
│ 12 │
│ 13 from . import Vec3D │
│ │
│ /opt/zetta_utils/zetta_utils/builder/__init__.py:11 in <module> │
│ │
│ 8 │ get_initial_builder_spec, │
│ 9 │ UnpicklableDict, │
│ 10 ) │
│ ❱ 11 from . import built_in_registrations │
│ 12 │
│ 13 PARALLEL_BUILD_ALLOWED: bool = False │
│ 14 │
│ │
│ /opt/zetta_utils/zetta_utils/builder/built_in_registrations.py:5 in <module> │
│ │
│ 2 │
│ 3 from typing import Any, Callable, Optional │
│ 4 │
│ ❱ 5 import torch # pylint: disable=unused-import │
│ 6 │
│ 7 from .building import BuilderPartial │
│ 8 from .registry import register │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/__init__.py:1465 in <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_meta_registrations.py:7 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_decomp/__init__.py:169 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_decomp/decompositions.py:10 │
│ in <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_prims/__init__.py:33 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_subclasses/__init__.py:3 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py:13 │
│ in <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/torch/_guards.py:14 in <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/sympy/__init__.py:73 in <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/__init__.py:75 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/polyfuncs.py:11 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/specialpolys.py:297 in │
│ <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/rings.py:30 in <module> │
│ │
│ /opt/conda/lib/python3.10/site-packages/sympy/printing/__init__.py:25 in │
│ <module> │
│ in _find_and_load:1027 │
│ in _find_and_load_unlocked:1006 │
│ in _load_unlocked:688 │
│ in exec_module:879 │
│ in get_code:1016 │
│ in get_data:1074 │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: [Errno 5] Input/output error
Bus error (core dumped)
How do we fix this? It is very annoying.
A fix would be to add a corresponding transient error condition to mazepa
@supersergiy Is this the list? https://github.com/ZettaAI/zetta_utils/blob/9295d11ce0d53c48fe1ce4b49b5901ce4f8b5838/zetta_utils/mazepa/transient_errors.py
Should it be something like?
TransientErrorCondition(
exception_type=OSError,
text_signature="[Errno 5] Input/output error",
),
Yup, that's the one!
Sorry I should've linked it in my comment to begin with
Made a branch for the fix: https://github.com/ZettaAI/zetta_utils/tree/tri/transient-errors. Need to test though before making a PR.
Another one:
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable
Full traceback:
Task traceback: Traceback (most recent call last):
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 61, in get_connection
conn = self.pool.get(block=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/queue.py", line 168, in get
raise Empty
_queue.Empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
return_value = self._call_task_fn(debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
return_value = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/volumetric_callable_operation.py", line 90, in __call__
task_kwargs = _process_callable_kwargs(idx_input_padded, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/callable_operation.py", line 23, in _process_callable_kwargs
result[k] = v.read_with_procs(idx)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/layer/layer_base.py", line 53, in read_with_procs
data_backend = self.backend.read(idx=idx_proced)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/layer/volumetric/cloudvol/backend.py", line 198, in read
data_raw = cvol[idx.to_slices()]
~~~~^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__
img = self.download(requested_bbox, self.mip)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download
tup = self.image.download(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download
return rx.download(
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download
download_chunks_threaded(
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded
schedule_jobs(
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs
return schedule_threaded_jobs(fns, concurrency, progress, total)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs
with ThreadedQueue(n_threads=concurrency) as tq:
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__
self.wait(progress=self.with_progress)
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait
self._check_errors()
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors
raise err
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue
self._consume_queue_execution(fn)
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution
fn()
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn
res = fn()
^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process
labels, bbox = download_chunk(
^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 510, in download_chunk
).get([ filename ], raw=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 100, in inner_decor
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 441, in get
ret = download(first(paths))
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 426, in download
raise error
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 410, in download
with self._get_connection() as conn:
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 311, in _get_connection
return self._interface_cls(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/interfaces.py", line 503, in __init__
self._bucket = GC_POOL[GCloudBucketPoolParams(self._path.bucket, self._request_payer)].get_connection(secrets, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 64, in get_connection
conn = self._create_connection(secrets, endpoint)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 326, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 406, in __call__
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 362, in iter
raise retry_exc.reraise()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 195, in reraise
raise self.last_attempt.result()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 409, in __call__
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 150, in _create_connection
client = Client(
^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/google/cloud/storage/client.py", line 235, in __init__
if self._credentials.universe_domain != self.universe_domain:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/credentials.py", line 154, in universe_domain
self._universe_domain = _metadata.get_universe_domain(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 284, in get_universe_domain
universe_domain = get(
^^^^
File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 217, in get
raise exceptions.TransportError(
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable
And another one:
mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 138
Task traceback: Traceback (most recent call last):
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
return_value = self._call_task_fn(debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
return_value = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/subchunkable_apply_flow.py", line 73, in __call__
mazepa.Executor(
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 47, in __call__
return execute(
^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 129, in execute
_execute_from_state(
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 179, in _execute_from_state
submit_ready_tasks(
File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 216, in submit_ready_tasks
task_outcomes = outcome_queue.pull(max_num=100)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 43, in pull
results.append(execute_task(task, self.debug, self.handle_exceptions))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 56, in execute_task
finished_processing, outcome = process_task_message(
^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/worker.py", line 107, in process_task_message
outcome = task(debug=debug, handle_exceptions=handle_exceptions)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 110, in __call__
return_value = self._call_task_fn(debug=debug)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
return_value = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/compute_field_flow.py", line 147, in __call__
src_data, src_field_data, src_translation = translation_adjusted_download(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/common.py", line 45, in translation_adjusted_download
xy_translation_raw = alignment.field.profile_field2d_percentile(field_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/zetta_utils/zetta_utils/internal/alignment/field.py", line 26, in profile_field2d_percentile
if nonzero_field.sum() == 0 or len(nonzero_field.shape) == 1:
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Part of a longer run that crashed last night with:
Not sure what causes it, but restarting the same flow worked without a problem, so must be another retriable error?
Update: 2024-07-30 Have not seen this error in a while, which makes me think it is indeed related to Image Streaming on GKE. Never happened before we enabled it. And now that Image streaming doesn't seem to work anymore, the error is gone, too.