ZettaAI / zetta_utils

MIT License
11 stars 0 forks source link

Allow option to treat all errors as "transient" #434

Open nkemnitz opened 1 year ago

nkemnitz commented 1 year ago

Long-running CPU job kept failing with SSL errors (timeouts?) every 30 minutes or so. The transient error logic we have right now works, but as they are part of the task payload itself, the tasks are getting more and more bloated. And we probably keep running into more and more of these temporary errors over time:

Task traceback: Traceback (most recent call last):                           
  File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen              
    httplib_response = self._make_request(  
  File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request        
    self._validate_conn(conn)               
  File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn      
    conn.connect()                          
  File "/opt/conda/lib/python3.10/site-packages/urllib3/connection.py", line 414, in connect                  
    self.sock = ssl_wrap_socket(            
  File "/opt/conda/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 449, in ssl_wrap_socket           
    ssl_sock = _ssl_wrap_socket_impl(       
  File "/opt/conda/lib/python3.10/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl     
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)    
  File "/opt/conda/lib/python3.10/ssl.py", line 513, in wrap_socket          
    return self.sslsocket_class._create(    
  File "/opt/conda/lib/python3.10/ssl.py", line 1071, in _create             
    self.do_handshake()                     
  File "/opt/conda/lib/python3.10/ssl.py", line 1342, in do_handshake        
    self._sslobj.do_handshake()             
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:997)          

During handling of the above exception, another exception occurred:          

Traceback (most recent call last):          
  File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 486, in send                      
    resp = conn.urlopen(                    
  File "/opt/conda/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen              
    retries = retries.increment(            
  File "/opt/conda/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment                
    raise MaxRetryError(_pool, url, error or ResponseError(cause))           
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url:                 
/upload/storage/v1/b/***redacted***/o?uploadType=resumable (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of       
protocol (_ssl.c:997)')))                   

During handling of the above exception, another exception occurred:          

Traceback (most recent call last):          
  File "/opt/zetta_utils/zetta_utils/mazepa/tasks.py", line 188, in _call_without_upkeep                      
    return_value = self.fn(*self.args, **self.kwargs)                        
  File "/opt/zetta_utils/zetta_utils/mazepa_layer_processing/common/volumetric_callable_operation.py", line 86, in __call__                    
    dst[idx] = dst_data                     
  File "/opt/zetta_utils/zetta_utils/layer/volumetric/layer.py", line 37, in __setitem__                      
    self.write_with_procs(idx=idx_backend, data=data_backend)                
  File "/opt/zetta_utils/zetta_utils/layer/layer_base.py", line 78, in write_with_procs                       
    self.backend.write(idx=idx_proced, data=data_proced)                     
  File "/opt/zetta_utils/zetta_utils/layer/volumetric/cloudvol/backend.py", line 186, in write                
    cvol[slices] = data_final               
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/frontends/precomputed.py", line 961, in __setitem__                                
    self.image.upload(img, bbox.minpt, self.mip, parallel=self.parallel)     
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/__init__.py", line 53, in guardfn      
    return fn(self, *args, **kwargs)        
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/__init__.py", line 345, in upload                     
    return tx.upload(                       
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/tx.py", line 95, in upload                            
    upload_aligned(                         
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/tx.py", line 169, in upload_aligned                   
    threaded_upload_chunks(                 
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/tx.py", line 365, in threaded_upload_chunks           
    schedule_jobs(                          
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/scheduler.py", line 104, in schedule_jobs         
    return schedule_threaded_jobs(fns, concurrency, progress, total)         
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/scheduler.py", line 28, in schedule_threaded_jobs 
    with ThreadedQueue(n_threads=concurrency) as tq:                         
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/threaded_queue.py", line 257, in __exit__         
    self.wait(progress=self.with_progress)  
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/threaded_queue.py", line 234, in wait             
    self._check_errors()                    
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/threaded_queue.py", line 191, in _check_errors    
    raise err                               
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/threaded_queue.py", line 153, in _consume_queue   
    self._consume_queue_execution(fn)       
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/threaded_queue.py", line 180, in _consume_queue_execution                          
    fn()   
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/scheduler.py", line 23, in realupdatefn           
    res = fn()                              
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/tx.py", line 359, in process_and_update               
    process(*args, **kwargs)                
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/tx.py", line 356, in process                          
    do_upload(imgchunk, cloudpath)          
  File "/opt/conda/lib/python3.10/site-packages/cloudvolume/datasource/precomputed/image/tx.py", line 306, in do_upload                        
    remote.put(                             
  File "/opt/conda/lib/python3.10/site-packages/cloudfiles/cloudfiles.py", line 586, in put                   
    return self.puts({                      
  File "/opt/conda/lib/python3.10/site-packages/cloudfiles/cloudfiles.py", line 98, in inner_decor            
    return fn(*args, **kwargs)              
  File "/opt/conda/lib/python3.10/site-packages/cloudfiles/cloudfiles.py", line 550, in puts                  
    uploadfn(first(files))                  
  File "/opt/conda/lib/python3.10/site-packages/cloudfiles/cloudfiles.py", line 533, in uploadfn              
    conn.put_file(                          
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 326, in wrapped_f                 
    return self(f, *args, **kw)             
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 406, in __call__                  
    do = self.iter(retry_state=retry_state) 
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 362, in iter                      
    raise retry_exc.reraise()               
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 195, in reraise                   
    raise self.last_attempt.result()        
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 451, in result                           
    return self.__get_result()              
  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result                     
    raise self._exception                   
  File "/opt/conda/lib/python3.10/site-packages/tenacity/__init__.py", line 409, in __call__                  
    result = fn(*args, **kwargs)            
  File "/opt/conda/lib/python3.10/site-packages/cloudfiles/interfaces.py", line 468, in put_file              
    blob.upload_from_string(content, content_type)                           
  File "/opt/conda/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2815, in upload_from_string                                
    self.upload_from_file(                  
  File "/opt/conda/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2540, in upload_from_file 
    created_json = self._do_upload(         
  File "/opt/conda/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2371, in _do_upload       
    response = self._do_resumable_upload(   
  File "/opt/conda/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2198, in _do_resumable_upload                              
    upload, transport = self._initiate_resumable_upload(                     
  File "/opt/conda/lib/python3.10/site-packages/google/cloud/storage/blob.py", line 2079, in _initiate_resumable_upload                        
    upload.initiate(                        
  File "/opt/conda/lib/python3.10/site-packages/google/resumable_media/requests/upload.py", line 420, in initiate                              
    return _request_helpers.wait_and_retry( 
  File "/opt/conda/lib/python3.10/site-packages/google/resumable_media/requests/_request_helpers.py", line 178, in wait_and_retry              
    raise error                             
  File "/opt/conda/lib/python3.10/site-packages/google/resumable_media/requests/_request_helpers.py", line 155, in wait_and_retry              
    response = func()                       
  File "/opt/conda/lib/python3.10/site-packages/google/resumable_media/requests/upload.py", line 412, in retriable_request                     
    result = transport.request(             
  File "/opt/conda/lib/python3.10/site-packages/google/auth/transport/requests.py", line 549, in request      
    response = super(AuthorizedSession, self).request(                       
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request                   
    resp = self.send(prep, **send_kwargs)   
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send                      
    r = adapter.send(request, **kwargs)     
  File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 517, in send                      
    raise SSLError(e, request=request)      
requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url:                     
/upload/storage/v1/b/***redacted***/o?uploadType=resumable (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of       
protocol (_ssl.c:997)')))
supersergiy commented 1 year ago

The downside of treating all errors as transient is that real errors will get swallowed, and the user will have to dig into the k8s logs to figure out why the tasks are not getting completed.

What is driving you toward this feature more -- the fact that the tasks get bloated with more transient conditions, or that it's annoying to update the transient condition list when encountering them?

nkemnitz commented 1 year ago

I haven't measured how much bigger the tasks are with the transient error list, and if it impacts task retrieval. For now, the main issue for me is definitely the second part. Also because I am often confused whether the problem is with me or not:

Over the weekend I copied a cutout from a public dataset, which almost immediately crashed with a JPEG Decompression error. Thought maybe I messed up the bounding box, or the source dataset is corrupt, but couldn't find anything. Eventually I re-ran the same code without any changes and it went through just fine. Must have been an incomplete file transfer that somehow made it through all the other checks in boto and cloudvolume.