genmoai / models

The best OSS video generation models
Apache License 2.0
1.75k stars 174 forks source link

Error at the end of generation: Error while creating shared memory segment #36

Open ichernev opened 1 week ago

ichernev commented 1 week ago

Running on 4xH100 as specified in readme:

Traceback (most recent call last):                                                                                                                                                         
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/gradio/queueing.py", line 624, in process_events                                                                         
    response = await route_utils.call_process_api(
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/gradio/route_utils.py", line 323, in call_process_api                                                                    
    output = await app.get_blocks().process_api(
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/gradio/blocks.py", line 2018, in process_api                                                                             
    result = await self.call_function(
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/gradio/blocks.py", line 1567, in call_function                                                                           
    prediction = await anyio.to_thread.run_sync(  # type: ignore
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync                                                                                
    return await get_async_backend().run_sync_in_worker_thread(
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2441, in run_sync_in_worker_thread                                                    
    return await future
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 943, in run                                                                           
    result = context.run(func, *args)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/gradio/utils.py", line 846, in wrapper                                                                                   
    response = f(*args, **kwargs)
  File "/root/workdir/models/demos/cli.py", line 94, in generate_video                                                                                                                     
    final_frames = pipeline(**args)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/pipelines.py", line 656, in __call__                                                                 
    return ray.get([ctx.run.remote(fn=sample, **kwargs, show_progress=i == 0) for i, ctx in enumerate(self.ctxs)])[
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper                                                           
    return fn(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper                                                                  
    return func(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/ray/_private/worker.py", line 2745, in get                                                                               
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/ray/_private/worker.py", line 901, in get_objects                                                                        
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(DistBackendError): ray::MultiGPUContext.run() (pid=16621, ip=172.17.0.2, actor_id=16422e9e73186d307c970b1801000000, repr=<genmo.mochi_preview.pipelines.MultiGPUContext object at 0x7fe21adbebc0>)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/pipelines.py", line 611, in run
    return fn(self, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/pipelines.py", line 653, in sample
    frames = decode_latents(ctx.decoder, latents)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/pipelines.py", line 379, in decode_latents
    samples = decoder(z)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/vae/model.py", line 663, in forward
    x = block(x)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/vae/model.py", line 287, in forward
    x = self.stack(x)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/vae/model.py", line 155, in forward
    x = cp_pass_frames(x, context_size)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/genmo/mochi_preview/vae/cp_conv.py", line 44, in cp_pass_frames
    dist.recv(recv_buffer, global_rank - 1, group=group)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
    return func(*args, **kwargs)
  File "/root/workdir/models/.venv-2/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2200, in recv
    pg.recv([tensor], src, tag).wait()
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
Last error:
Error while creating shared memory segment /dev/shm/nccl-i6uXpH (size 7340384)

I ran it with NCCL_DEBUG=INFO, let me know if you need the output (didn't see anything interesting) other than maybe (the end):

MultiGPUContext pid=28257) dfd351ec56be:28257:39328 [0] NCCL INFO Connected all rings                                                                                                     
(MultiGPUContext pid=28257) dfd351ec56be:28257:39328 [0] NCCL INFO Connected all trees                                                                                                     
(MultiGPUContext pid=28257) dfd351ec56be:28257:39328 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512                                                                           
(MultiGPUContext pid=28257) dfd351ec56be:28257:39328 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer                       
(MultiGPUContext pid=28257) dfd351ec56be:28257:39328 [0] NCCL INFO ncclCommInitRank comm 0x559a62fef0f0 rank 0 nranks 2 cudaDev 0 nvmlDev 2 busId 4c000 commId 0x7c7cabf286d74373 - Init COMPLETE
(MultiGPUContext pid=28253)                                                                                                                                                                
(MultiGPUContext pid=28253) dfd351ec56be:28253:39351 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-o3UGjI to 7340388 bytes                                       
(MultiGPUContext pid=28253)                                                                                                                                                                
(MultiGPUContext pid=28253) dfd351ec56be:28253:39351 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-o3UGjI (size 7340384)                     
(MultiGPUContext pid=28253) dfd351ec56be:28253:39351 [0] NCCL INFO proxy.cc:1257 -> 2                                                                                                      
(MultiGPUContext pid=28253) dfd351ec56be:28253:39351 [0] NCCL INFO proxy.cc:1320 -> 2                                                                                                      
(MultiGPUContext pid=28253) dfd351ec56be:28253:39332 [0] NCCL INFO proxy.cc:1068 -> 2                                                                                                      
(MultiGPUContext pid=28253) dfd351ec56be:28253:39332 [0] NCCL INFO init.cc:1369 -> 2                                                                                                       
(MultiGPUContext pid=28253) dfd351ec56be:28253:39332 [0] NCCL INFO init.cc:1548 -> 2                                                                                                       
(MultiGPUContext pid=28253) dfd351ec56be:28253:39332 [0] NCCL INFO group.cc:64 -> 2 [Async thread]                                                                                         
(MultiGPUContext pid=28253) dfd351ec56be:28253:28253 [0] NCCL INFO group.cc:418 -> 2                                                                                                       
(MultiGPUContext pid=28253) dfd351ec56be:28253:28253 [0] NCCL INFO init.cc:1929 -> 2                                                                                      
ichernev commented 1 week ago

This is after the crash:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 80GB HBM3          On  | 00000000:19:00.0 Off |                    0 |
| N/A   36C    P0             115W / 700W |  26911MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  | 00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0             110W / 700W |  27457MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  | 00000000:4C:00.0 Off |                    0 |
| N/A   30C    P0             114W / 700W |  48119MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  | 00000000:5D:00.0 Off |                    0 |
| N/A   34C    P0             115W / 700W |  55873MiB / 81559MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

Before the crash the mem-usage is somewhat the same across GPUs, but something explodes at the end