ZettaAI / zetta_utils

MIT License
10 stars 0 forks source link

TransientError Condition? OSError: [Errno 5] Input/output error #591

Open nkemnitz opened 9 months ago

nkemnitz commented 9 months ago

Part of a longer run that crashed last night with:

2023-12-24 03:28:09.049 ERROR    mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 129
                                 Task traceback: Traceback (most recent call last):
                                   File "/opt/zetta_utils/zetta_utils/mazepa/worker.py", line 45, in run_worker
                                     task_msgs = task_queue.pull(max_num=max_pull_num)
                                   File "/opt/zetta_utils/zetta_utils/message_queues/sqs/queue.py", line 89, in pull
                                     payload = serialization.deserialize(tq_task.task_ser)
                                   File "/opt/zetta_utils/zetta_utils/message_queues/serialization.py", line 31, in deserialize
                                     result = _deserialize(s, pickle)
                                   File "/opt/zetta_utils/zetta_utils/message_queues/serialization.py", line 25, in _deserialize
                                     result = module.loads(zlib.decompress(codecs.decode(s.encode(), "base64")))
                                   File "/opt/conda/lib/python3.10/encodings/__init__.py", line 99, in search_function
                                   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
                                   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
                                   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
                                   File "<frozen importlib._bootstrap_external>", line 879, in exec_module
                                   File "<frozen importlib._bootstrap_external>", line 1016, in get_code
                                   File "<frozen importlib._bootstrap_external>", line 1074, in get_data
                                 OSError: [Errno 5] Input/output error

Not sure what causes it, but restarting the same flow worked without a problem, so must be another retriable error?

Update: 2024-07-30 Have not seen this error in a while, which makes me think it is indeed related to Image Streaming on GKE. Never happened before we enabled it. And now that Image streaming doesn't seem to work anymore, the error is gone, too.

supersergiy commented 9 months ago

Looks like it!

nkemnitz commented 7 months ago

Getting this now extremely often when cold-starting a k8s cluster:

/home/nkemnitz/zetta/zetta_utils/venv-3.11) nkemnitz@Eriador:~/zetta/zetta_utils$ kubectl logs --previous hissing-piquant-bear-of-honeydew-554dfdd88f-zqjl9
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/conda/bin/zetta:5 in <module>                                           │
│                                                                              │
│   2 # -*- coding: utf-8 -*-                                                  │
│   3 import re                                                                │
│   4 import sys                                                               │
│ ❱ 5 from zetta_utils.cli.main import cli                                     │
│   6 if __name__ == '__main__':                                               │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])     │
│   8 │   sys.exit(cli())                                                      │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/__init__.py:3 in <module>                       │
│                                                                              │
│    1 # pylint: disable=unused-import, import-outside-toplevel                │
│    2 """Zetta AI Computational Connectomics Toolkit."""                      │
│ ❱  3 from . import log, typing, parsing, builder, common                     │
│    4 from . import geometry, distributions, layer, ng                        │
│    5                                                                         │
│    6 builder.registry.MUTLIPROCESSING_INCOMPATIBLE_CLASSES.add("mazepa")     │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/parsing/__init__.py:2 in <module>               │
│                                                                              │
│   1 from . import cue                                                        │
│ ❱ 2 from . import ngl_state                                                  │
│   3 from . import json                                                       │
│   4                                                                          │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/parsing/ngl_state.py:16 in <module>             │
│                                                                              │
│    13 │   make_layer,                                                        │
│    14 )                                                                      │
│    15                                                                        │
│ ❱  16 from zetta_utils.geometry import BBox3D, Vec3D                         │
│    17 from zetta_utils.log import get_logger                                 │
│    18                                                                        │
│    19 logger = get_logger("zetta_utils")                                     │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/geometry/__init__.py:2 in <module>              │
│                                                                              │
│   1 from .vec import Vec3D, IntVec3D, RawVec3D                               │
│ ❱ 2 from .bbox import BBox3D                                                 │
│   3 from .bbox_strider import BBoxStrider                                    │
│   4                                                                          │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/geometry/bbox.py:10 in <module>                 │
│                                                                              │
│     7 import attrs                                                           │
│     8 from typeguard import typechecked                                      │
│     9                                                                        │
│ ❱  10 from zetta_utils import builder                                        │
│    11 from zetta_utils.geometry.vec import VEC3D_PRECISION                   │
│    12                                                                        │
│    13 from . import Vec3D                                                    │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/builder/__init__.py:11 in <module>              │
│                                                                              │
│    8 │   get_initial_builder_spec,                                           │
│    9 │   UnpicklableDict,                                                    │
│   10 )                                                                       │
│ ❱ 11 from . import built_in_registrations                                    │
│   12                                                                         │
│   13 PARALLEL_BUILD_ALLOWED: bool = False                                    │
│   14                                                                         │
│                                                                              │
│ /opt/zetta_utils/zetta_utils/builder/built_in_registrations.py:5 in <module> │
│                                                                              │
│    2                                                                         │
│    3 from typing import Any, Callable, Optional                              │
│    4                                                                         │
│ ❱  5 import torch  # pylint: disable=unused-import                           │
│    6                                                                         │
│    7 from .building import BuilderPartial                                    │
│    8 from .registry import register                                          │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/__init__.py:1465 in <module>   │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_meta_registrations.py:7 in    │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_decomp/__init__.py:169 in     │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_decomp/decompositions.py:10   │
│ in <module>                                                                  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_prims/__init__.py:33 in       │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_subclasses/__init__.py:3 in   │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py:13  │
│ in <module>                                                                  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/torch/_guards.py:14 in <module>      │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/__init__.py:73 in <module>     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/__init__.py:75 in        │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/polyfuncs.py:11 in       │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/specialpolys.py:297 in   │
│ <module>                                                                     │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/polys/rings.py:30 in <module>  │
│                                                                              │
│ /opt/conda/lib/python3.10/site-packages/sympy/printing/__init__.py:25 in     │
│ <module>                                                                     │
│ in _find_and_load:1027                                                       │
│ in _find_and_load_unlocked:1006                                              │
│ in _load_unlocked:688                                                        │
│ in exec_module:879                                                           │
│ in get_code:1016                                                             │
│ in get_data:1074                                                             │
╰──────────────────────────────────────────────────────────────────────────────╯
OSError: [Errno 5] Input/output error
Bus error (core dumped)
trivoldus28 commented 7 months ago

How do we fix this? It is very annoying.

supersergiy commented 7 months ago

A fix would be to add a corresponding transient error condition to mazepa

trivoldus28 commented 7 months ago

@supersergiy Is this the list? https://github.com/ZettaAI/zetta_utils/blob/9295d11ce0d53c48fe1ce4b49b5901ce4f8b5838/zetta_utils/mazepa/transient_errors.py

Should it be something like?

TransientErrorCondition(
        exception_type=OSError,
        text_signature="[Errno 5] Input/output error",
    ),
supersergiy commented 7 months ago

Yup, that's the one!

supersergiy commented 7 months ago

Sorry I should've linked it in my comment to begin with

trivoldus28 commented 7 months ago

Made a branch for the fix: https://github.com/ZettaAI/zetta_utils/tree/tri/transient-errors. Need to test though before making a PR.

nkemnitz commented 2 months ago

Another one: google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable

Full traceback:


Task traceback: Traceback (most recent call last):          
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 61, in get_connection                    
    conn = self.pool.get(block=False)                       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^                       
  File "/usr/lib/python3.11/queue.py", line 168, in get     
    raise Empty            
_queue.Empty               

During handling of the above exception, another exception occurred:                          

Traceback (most recent call last):                          
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__                  
    return_value = self._call_task_fn(debug=debug)          
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^          
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn             
    return_value = self.fn(*self.args, **self.kwargs)       
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/volumetric_callable_operation.py", line 90, in __call__   
    task_kwargs = _process_callable_kwargs(idx_input_padded, kwargs)                         
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/callable_operation.py", line 23, in _process_callable_kwargs                               
    result[k] = v.read_with_procs(idx)                      
                ^^^^^^^^^^^^^^^^^^^^^^                      
  File "/opt/zetta_utils/zetta_utils/layer/layer_base.py", line 53, in read_with_procs       
    data_backend = self.backend.read(idx=idx_proced)        
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^        
  File "/opt/zetta_utils/zetta_utils/layer/volumetric/cloudvol/backend.py", line 198, in read
    data_raw = cvol[idx.to_slices()]                        
               ~~~~^^^^^^^^^^^^^^^^^                        
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 551, in __getitem__               
    img = self.download(requested_bbox, self.mip)           
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^           
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/frontends/precomputed.py", line 731, in download                  
    tup = self.image.download(                              
          ^^^^^^^^^^^^^^^^^^^^                              
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 200, in download  
    return rx.download(    
           ^^^^^^^^^^^^    
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 295, in download        
    download_chunks_threaded(                               
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 599, in download_chunks_threaded                         
    schedule_jobs(         
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 150, in schedule_jobs                         
    return schedule_threaded_jobs(fns, concurrency, progress, total)                         
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 37, in schedule_threaded_jobs                 
    with ThreadedQueue(n_threads=concurrency) as tq:        
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 257, in __exit__                         
    self.wait(progress=self.with_progress)                  
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 227, in wait                             
    self._check_errors()   
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors                    
    raise err              
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue                   
    self._consume_queue_execution(fn)                       
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution         
    fn()                   
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/scheduler.py", line 32, in realupdatefn                           
    res = fn()             
          ^^^^             
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 554, in process         
    labels, bbox = download_chunk(                          
                   ^^^^^^^^^^^^^^^                          
  File "/usr/local/lib/python3.11/dist-packages/cloudvolume/datasource/precomputed/image/rx.py", line 510, in download_chunk  
    ).get([ filename ], raw=True)                           
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^                           
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 100, in inner_decor                           
    return fn(*args, **kwargs)                              
           ^^^^^^^^^^^^^^^^^^^                              
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 441, in get  
    ret = download(first(paths))                            
          ^^^^^^^^^^^^^^^^^^^^^^                            
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 426, in download                              
    raise error            
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 410, in download                              
    with self._get_connection() as conn:                    
         ^^^^^^^^^^^^^^^^^^^^^^                             
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/cloudfiles.py", line 311, in _get_connection                       
    return self._interface_cls(                             
           ^^^^^^^^^^^^^^^^^^^^                             
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/interfaces.py", line 503, in __init__                              
    self._bucket = GC_POOL[GCloudBucketPoolParams(self._path.bucket, self._request_payer)].get_connection(secrets, None)      
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^      
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 64, in get_connection                    
    conn = self._create_connection(secrets, endpoint)       
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 326, in wrapped_f
    return self(f, *args, **kw)                             
           ^^^^^^^^^^^^^^^^^^^^                             
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 406, in __call__ 
    do = self.iter(retry_state=retry_state)                 
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                 
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 362, in iter     
    raise retry_exc.reraise()                               
          ^^^^^^^^^^^^^^^^^^^                               
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 195, in reraise  
    raise self.last_attempt.result()                        
          ^^^^^^^^^^^^^^^^^^^^^^^^^^                        
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 449, in result                
    return self.__get_result()                              
           ^^^^^^^^^^^^^^^^^^^                              
  File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result          
    raise self._exception  
  File "/usr/local/lib/python3.11/dist-packages/tenacity/__init__.py", line 409, in __call__ 
    result = fn(*args, **kwargs)                            
             ^^^^^^^^^^^^^^^^^^^                            
  File "/usr/local/lib/python3.11/dist-packages/cloudfiles/connectionpools.py", line 150, in _create_connection               
    client = Client(       
             ^^^^^^^       
  File "/usr/local/lib/python3.11/dist-packages/google/cloud/storage/client.py", line 235, in __init__                        
    if self._credentials.universe_domain != self.universe_domain:                            
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                    
  File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/credentials.py", line 154, in universe_domain      
    self._universe_domain = _metadata.get_universe_domain(  
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  
  File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 284, in get_universe_domain    
    universe_domain = get( 
                      ^^^^ 
  File "/usr/local/lib/python3.11/dist-packages/google/auth/compute_engine/_metadata.py", line 217, in get                    
    raise exceptions.TransportError(                        
google.auth.exceptions.TransportError: Failed to retrieve http://metadata.google.internal/computeMetadata/v1/universe/universe_domain from the Google Compute Engine metadata service. Compute Engine Metadata server unavailable
nkemnitz commented 1 month ago

And another one:

mazepa /home/nkemnitz/zetta_utils/zetta_utils/mazepa/execution_state.py: 138
Task traceback: Traceback (most recent call last):
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 99, in __call__
    return_value = self._call_task_fn(debug=debug)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
    return_value = self.fn(*self.args, **self.kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/subchunkable_apply_flow.py", line 73, in __call__
    mazepa.Executor(
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 47, in __call__
    return execute(
           ^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 129, in execute
    _execute_from_state(
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 179, in _execute_from_state
    submit_ready_tasks(
  File "/opt/zetta_utils/zetta_utils/mazepa/execution.py", line 216, in submit_ready_tasks
    task_outcomes = outcome_queue.pull(max_num=100)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 43, in pull
    results.append(execute_task(task, self.debug, self.handle_exceptions))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/autoexecute_task_queue.py", line 56, in execute_task
    finished_processing, outcome = process_task_message(
                                   ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/worker.py", line 107, in process_task_message
    outcome = task(debug=debug, handle_exceptions=handle_exceptions)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 110, in __call__
    return_value = self._call_task_fn(debug=debug)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 79, in _call_task_fn
    return_value = self.fn(*self.args, **self.kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/compute_field_flow.py", line 147, in __call__
    src_data, src_field_data, src_translation = translation_adjusted_download(
                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/internal/alignment/flows/common.py", line 45, in translation_adjusted_download
    xy_translation_raw = alignment.field.profile_field2d_percentile(field_data)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/zetta_utils/zetta_utils/internal/alignment/field.py", line 26, in profile_field2d_percentile
    if nonzero_field.sum() == 0 or len(nonzero_field.shape) == 1:
       ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.