Multihost training collapses from time to time when loading the next batch

YUE-FAN commented 3 months ago

Hi,

I was testing the multi-host training on a v4-16 TPU VM. The training normally runs smoothly, but sometimes, it collapses at load_next_batch with the following error from the process 0:

completed step: 80041, seconds: 0.448, TFLOP/s/device: 62.454, loss: 3.111                                                                                                           
completed step: 80042, seconds: 0.628, TFLOP/s/device: 44.624, loss: 3.115                                                                                                           
completed step: 80043, seconds: 0.271, TFLOP/s/device: 103.424, loss: 3.052                                                                                                          
completed step: 80044, seconds: 0.447, TFLOP/s/device: 62.600, loss: 3.087                                                                                                           
completed step: 80045, seconds: 0.448, TFLOP/s/device: 62.527, loss: 3.099                                                                                                           
completed step: 80046, seconds: 0.448, TFLOP/s/device: 62.530, loss: 3.087                                                                                                           
completed step: 80047, seconds: 0.448, TFLOP/s/device: 62.492, loss: 3.088                                                                                                           
completed step: 80048, seconds: 0.454, TFLOP/s/device: 61.738, loss: 3.092                                                                                                           
completed step: 80049, seconds: 0.443, TFLOP/s/device: 63.173, loss: 3.093                                                                                                           
completed step: 80050, seconds: 0.448, TFLOP/s/device: 62.510, loss: 3.041                                                                                                           
To see full metrics 'tensorboard --logdir=gs://maxtext_multihost_job/gpt2_steps12w/tensorboard/'                                                                               
I0718 10:27:47.821592 139629892142656 grain_pool.py:398] Grain pool is exiting.                                                                                                      
I0718 10:27:47.821762 139629892142656 grain_pool.py:403] Shutting down multiprocessing system.                                                                                       
I0718 10:27:50.149074 139629892142656 grain_pool.py:403] Shutting down multiprocessing system.                                                                                       
Exception ignored in: <Finalize object, dead>                                                                                                                              [202/1921]
Traceback (most recent call last):                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                                                                                                                 
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                                                                                                                 
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                                                                                                                 
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                                                                                                                        
Traceback (most recent call last):                                                                                                                                                   
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory 
Exception ignored in: <Finalize object, dead>                                                                                                                              [160/1921]
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory
Exception ignored in: <Finalize object, dead>                                                                                                                              [118/1921]
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Exception ignored in: <Finalize object, dead>                                             
Traceback (most recent call last):                                                        
  File "/usr/lib/python3.10/multiprocessing/util.py", line 224, in __call__                                                                                                          
    res = self._callback(*self._args, **self._kwargs)                                                                                                                                
  File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 87, in _cleanup                                                                                                    
    sem_unlink(name)                                                                      
FileNotFoundError: [Errno 2] No such file or directory                                                                                                                               
Traceback (most recent call last):                                                        
  File "/home/yfan/maxtext/MaxText/train.py", line 669, in <module>                                                                                                                  
    app.run(main)                                                                         
  File "/home/yfan/.local/lib/python3.10/site-packages/absl/app.py", line 308, in run                                                                                                
    _run_main(main, args)                                                                 
  File "/home/yfan/.local/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main                                                                                          
    sys.exit(main(argv))                                                                  
  File "/home/yfan/maxtext/MaxText/train.py", line 665, in main                                                                                                                      
    train_loop(config)                                                                    
  File "/home/yfan/maxtext/MaxText/train.py", line 561, in train_loop                                                                                                                
    example_batch = load_next_batch(data_iterator, example_batch, config)                                                                                                            
  File "/home/yfan/maxtext/MaxText/train.py", line 94, in load_next_batch                                                                                                            
    return next(train_iter)                                                               
  File "/home/yfan/maxtext/MaxText/multihost_dataloading.py", line 119, in __next__                                                                                                  
    return get_next_batch_sharded(self.local_iterator, self.global_mesh)                                                                                                             
  File "/home/yfan/maxtext/MaxText/multihost_dataloading.py", line 78, in get_next_batch_sharded                                                                                     
    local_data = next(local_iterator)                                                     
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/data_loader.py", line 416, in __next__
    result_record = next(self._iterator)                                                  
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/data_loader.py", line 348, in _iterator_with_context
    yield from it                                                                         
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/grain_pool.py", line 634, in __next__
    result = multiprocessing_common.get_async_result(                                                                                                                                
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/multiprocessing_common.py", line 81, in get_async_result
    return async_result.get(timeout=_ASYNC_RESULT_WAIT_TIMEOUT_SECONDS)                                                                                                              
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get                                                                                                               
    raise self._value                                                                     
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker    
    result = (True, func(*args, **kwds))                                                                                                                                    [76/1921]
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/grain_pool.py", line 529, in _open_shared_memory_for_structure
    structure.data = tree.map_structure(                                                  
  File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/tree_util.py", line 343, in tree_map
    return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))                                                                                                                      
  File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/tree_util.py", line 343, in <genexpr>
    return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))                                                                                                                      
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/grain_pool.py", line 522, in _open_shared_memory_for_leaf
    element = shared_memory_array.SharedMemoryArray.from_metadata(element)                                                                                                           
  File "/home/yfan/.local/lib/python3.10/site-packages/grain/_src/python/shared_memory_array.py", line 99, in from_metadata
    shm = shared_memory.SharedMemory(metadata.name)                                                                                                                                  
  File "/usr/lib/python3.10/multiprocessing/shared_memory.py", line 104, in __init__                                                                                                 
    self._fd = _posixshmem.shm_open(                                                      
FileNotFoundError: [Errno 2] No such file or directory: '/psm_dcc9e254'                                                                                                              
2024-07-18 10:27:50.743530: I external/xla/xla/pjrt/distributed/client.cc:141] Distributed task shutdown initiated.
E0718 10:32:50.744105  128013 coordination_service_agent.cc:514] Failed to disconnect from coordination service with status: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1721298770.744007696","description":"Error received from peer ipv4:10.130.0.10:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":10
56,"grpc_message":"Deadline Exceeded","grpc_status":4}
Proceeding with agent shutdown anyway. This is usually caused by an earlier error during execution. Check the logs (this task or the leader) for an earlier error to debug further.
2024-07-18 10:32:50.744385: I external/xla/xla/pjrt/distributed/client.cc:143] Distributed task shutdown result: DEADLINE_EXCEEDED: Deadline Exceeded
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1721298770.744007696","description":"Error received from peer ipv4:10.130.0.10:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":10
56,"grpc_message":"Deadline Exceeded","grpc_status":4}
2024-07-18 10:32:50.744433: I external/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager.cc:168] Cancelled call to retrieve preemption notice. This is expected upon
 program shutdown.
Exception ignored in atexit callback: <function shutdown at 0x7f0499b6c820>                                                                                                          
Traceback (most recent call last):                                                        
  File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/distributed.py", line 208, in shutdown
    global_state.shutdown()                                                               
  File "/home/yfan/.local/lib/python3.10/site-packages/jax/_src/distributed.py", line 110, in shutdown
    self.client.shutdown()                                                                
jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Deadline Exceeded                                                                                                           
Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/ShutdownTask:
:{"created":"@1721298770.744007696","description":"Error received from peer ipv4:10.130.0.10:8476","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":10
56,"grpc_message":"Deadline Exceeded","grpc_status":4}
Exception ignored in: <function GCSRecordWriter.__del__ at 0x7f03a6ab4af0>                                                                                                           
Traceback (most recent call last):                                                        
  File "/home/yfan/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 134, in __del__
  File "/home/yfan/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 158, in close
  File "/home/yfan/.local/lib/python3.10/site-packages/tensorboardX/record_writer.py", line 149, in flush
  File "/usr/lib/python3.10/copy.py", line 92, in copy
ImportError: sys.meta_path is None, Python is likely shutting down                                                                                                          [34/1921]
2024-07-18 10:32:52.026393: I external/xla/xla/tsl/distributed_runtime/preemption/preemption_sync_manager.cc:141] Preemption sync protocol cancelled by notifier: CANCELLED: Preempti
on notifier is being deleted.. This is expected during program shutdown.
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 14 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '                                                                                                                           
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-c21cvcq7': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-0364s4ot': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-242c58wv': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-9neplrq_': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-kathvzxa': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-2gpbgbge': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-w2u838hn': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-cjwehx__': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-f9iq947x': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-z3u4n_51': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-n30yaud_': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-1ypluvf6': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-hndkyutu': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/mp-089mqlp9': [Errno 2] No such file or directory
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 25 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '                                                                                                                           
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_378e39a0': [Errno 2] No such file or directory: '/psm_378e39a0'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_41013556': [Errno 2] No such file or directory: '/psm_41013556'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_c9096684': [Errno 2] No such file or directory: '/psm_c9096684'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_7c888814': [Errno 2] No such file or directory: '/psm_7c888814'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_ae258c84': [Errno 2] No such file or directory: '/psm_ae258c84'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_34e3d518': [Errno 2] No such file or directory: '/psm_34e3d518'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_bebfe91c': [Errno 2] No such file or directory: '/psm_bebfe91c'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_3047051e': [Errno 2] No such file or directory: '/psm_3047051e'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_34c56457': [Errno 2] No such file or directory: '/psm_34c56457'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_f2c937a3': [Errno 2] No such file or directory: '/psm_f2c937a3'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_0984c489': [Errno 2] No such file or directory: '/psm_0984c489'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_dcc9e254': [Errno 2] No such file or directory: '/psm_dcc9e254'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_eb74c162': [Errno 2] No such file or directory: '/psm_eb74c162'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_b1671147': [Errno 2] No such file or directory: '/psm_b1671147'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_d52ab94e': [Errno 2] No such file or directory: '/psm_d52ab94e'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_2d943663': [Errno 2] No such file or directory: '/psm_2d943663'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_eda31725': [Errno 2] No such file or directory: '/psm_eda31725'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_7d97f3a7': [Errno 2] No such file or directory: '/psm_7d97f3a7'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_6d256567': [Errno 2] No such file or directory: '/psm_6d256567'
  warnings.warn('resource_tracker: %r: %s' % (name, e))                                                                                                                              
/usr/lib/python3.10/multiprocessing/resource_tracker.py:237: UserWarning: resource_tracker: '/psm_e0c0e854': [Errno 2] No such file or directory: '/psm_e0c0e854'
  warnings.warn('resource_tracker: %r: %s' % (name, e))

The command for running the job is python3 MaxText/train.py MaxText/configs/gpt2.yml run_name=gpt2 base_output_directory=gs://maxtext_multihost_job steps=120000 dataset_type=hf hf_path=YUE-FAN/openwebtext_gcp hf_data_dir=data tokenizer_path=EleutherAI/gpt-neox-20b eval_interval=4000 hf_eval_split=validation enable_checkpointing=True eval_batch_num=558 per_device_batch_size=32 eval_per_device_batch_size=32 checkpoint_period=10000 logits_via_embedding=True normalize_embedding_logits=True. I have very limited knowledge about Python multiprocessing, but it seems to be a problem related to reading the shared memory? This problem does not always occur, but it happens from time to time. Any assistance here will be appreciated! Thanks!

maciek-pioro commented 3 months ago

+1

maciek-pioro commented 3 months ago

My current workaround is to catch the FileNotFoundError and recreate the data_iterator. To get back to the batch you were on, you can then run load_next_batch multiple times:

def create_skipped_iterator(config, mesh, step):
    while True:
        print(f'Starting skipping: step={step}, time={datetime.datetime.now()}')
        try:
            data_iterator, _ = create_data_iterator(config, mesh)
            for _ in range(step):
                _ = load_next_batch(data_iterator, None, config)
            break
        except FileNotFoundError:
            print("Encountered FileNotFoundError during skipping :(, will create a new data_iterator")
            continue
    print(f'Finished skipping: step={step}, time={datetime.datetime.now()}')
    return data_iterator

but that's a very bad, error-prone, slow solution.

YUE-FAN commented 3 months ago

My current workaround is to catch the FileNotFoundError and recreate the data_iterator. To get back to the batch you were on, you can then run load_next_batch multiple times:

def create_skipped_iterator(config, mesh, step):
    while True:
        print(f'Starting skipping: step={step}, time={datetime.datetime.now()}')
        try:
            data_iterator, _ = create_data_iterator(config, mesh)
            for _ in range(step):
                _ = load_next_batch(data_iterator, None, config)
            break
        except FileNotFoundError:
            print("Encountered FileNotFoundError during skipping :(, will create a new data_iterator")
            continue
    print(f'Finished skipping: step={step}, time={datetime.datetime.now()}')
    return data_iterator

but that's a very bad, error-prone, slow solution.

That's a nice workaround :) thanks a lot!

I made a small modification so that you don't have to recreate the data iterator every time and skip all the data from the beginning. If we checkpoint the data iterator using Grain, your function can also be quite efficient with the following modification. This code at the moment works for me :)

def create_skipped_iterator(config, mesh, step):
    while True:
      print(f'>>> Starting skipping: step={step}, time={datetime.datetime.now()}')
      try:
        start_step = int(step // config.checkpoint_period * config.checkpoint_period)
        storage_client = storage.Client()
        bucket = storage_client.get_bucket(config.gcp_bucket)
        path = f'{config.run_name}/checkpoints/{start_step}/iter/process_{int(jax.process_index())}-of-{int(jax.process_count())}.json'
        print(f'>>> load dataloader state from path {path}')
        blob = bucket.blob(path)
        state = json.loads(blob.download_as_string(client=None))
        state = json.dumps(state, indent=4).encode()
        storage_client.close()

        data_iterator, _ = create_data_iterator(config, mesh)
        data_iterator.local_iterator.set_state(state)  # see https://github.com/google/grain/blob/main/grain/_src/python/data_loader_test.py#L496C28-L496C77

        print(f'>>> skipping from step={start_step + 1} to step={step - 1}')
        for _ in range(start_step + 1, step):
          _ = load_next_batch(data_iterator, None, config)
        break
      except FileNotFoundError:
        print(">>> Encountered FileNotFoundError during skipping :(, will create a new data_iterator")
        continue
    print(f'>>> Finished skipping: step={step}, time={datetime.datetime.now()}')
    return data_iterator

PS: it just occurrs to me when writing this reply that maybe a better workaround is to keep a copy of the data iterator's state from the previous step inside the train loop (state = data_iterator.local_iterator.get_state()). In case of FileNotFoundError, we just repeat this single step from the state until it passes:

example_batch = load_next_batch(data_iterator, example_batch, config) ->

try:
  state = data_iterator.local_iterator.get_state()
  example_batch = load_next_batch(data_iterator, example_batch, config)
except FileNotFoundError:
  while True:
      try:
        data_iterator.local_iterator.set_state(state)
        example_batch = load_next_batch(data_iterator, None, config)
        break
      except FileNotFoundError:
        data_iterator.local_iterator.set_state(state)
        continue

Though I have tested this code yet...

AI-Hypercomputer / maxtext

Multihost training collapses from time to time when loading the next batch #786