chengchen666 commented 1 month ago

Issue log:

root@5b2cd4b2aca7:/cchen/Quark/test# mpirun -np 2  --allow-run-as-root python3 pytorch_minimal.py
[5b2cd4b2aca7:00372] opal_ifinit: ioctl(SIOCGIFADDR) failed with errno=19
[5b2cd4b2aca7:00372] *** Process received signal ***
[5b2cd4b2aca7:00372] Signal: Floating point exception (8)
[5b2cd4b2aca7:00372] Signal code: Integer divide-by-zero (1)
[5b2cd4b2aca7:00372] Failing at address: 0x2afa05c7472e
[5b2cd4b2aca7:00372] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x2afa05e42520]
[5b2cd4b2aca7:00372] [ 1] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb772e)[0x2afa05c7472e]
[5b2cd4b2aca7:00372] [ 2] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb85d6)[0x2afa05c755d6]
[5b2cd4b2aca7:00372] [ 3] /opt/hpcx/ompi/lib/libopen-pal.so.40(+0xb8b0e)[0x2afa05c75b0e]
[5b2cd4b2aca7:00372] [ 4] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_hwloc201_hwloc_topology_load+0xdb)[0x2afa05c866eb]
[5b2cd4b2aca7:00372] [ 5] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_hwloc_base_get_topology+0x1116)[0x2afa05c53e66]
[5b2cd4b2aca7:00372] [ 6] /opt/hpcx/ompi/lib/openmpi/mca_ess_hnp.so(+0x6686)[0x2afa0605f686]
[5b2cd4b2aca7:00372] [ 7] /opt/hpcx/ompi/lib/libopen-rte.so.40(orte_init+0x2b8)[0x2afa05b94fd8]
[5b2cd4b2aca7:00372] [ 8] /opt/hpcx/ompi/lib/libopen-rte.so.40(orte_submit_init+0x8e5)[0x2afa05b45535]
[5b2cd4b2aca7:00372] [ 9] mpirun(+0x13a3)[0x5565f56533a3]
[5b2cd4b2aca7:00372] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x2afa05e29d90]
[5b2cd4b2aca7:00372] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x2afa05e29e40]
[5b2cd4b2aca7:00372] [12] mpirun(+0x11f5)[0x5565f56531f5]
[5b2cd4b2aca7:00372] *** End of error message ***
Floating point exception

sudo docker run -it --runtime=quark_d -v /home/Quark:/Quark --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash

To reproduce: mpirun -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py and change code of pytorch_minimal.py: change device from GPU to cpu, to make sure cpu version program can work first.

# device = torch.device("cuda:0")
device = torch.device("cpu")

chengchen666 commented 1 month ago

https://github.com/QuarkContainer/Quark/blob/gpu-multiprocessing/test/multiprocess_torchminimal.py also failed. Test program is in branch:gpu-multiprocessing. just run python3 multiprocess_torchminimal.py will launch two cpu program by using Ray. So this program requires an image with Ray preinstalled

QuarkContainer commented 1 month ago

The repro could be simplified as below.

sudo docker run -it --runtime=quark_d --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:24.01-py3 bash -c "mpirun -np 2 --allow-run-as-root ls"

As the crash is at /opt/hpcx/ompi/lib/libopen-pal.so, I download it and objdump it and get the assembly code as low. It crashed at last line.

b7701: 48 8b 7c 24 08 mov 0x8(%rsp),%rdi b7706: 8b 77 20 mov 0x20(%rdi),%esi b7709: 85 f6 test %esi,%esi b770b: 75 69 jne b7776 <look_proc.isra.0+0x326> b770d: 83 c1 01 add $0x1,%ecx b7710: 41 89 4f 2c mov %ecx,0x2c(%r15) b7714: 85 db test %ebx,%ebx b7716: 75 2a jne b7742 <look_proc.isra.0+0x2f2> b7718: c1 e8 1a shr $0x1a,%eax b771b: 31 d2 xor %edx,%edx b771d: 8d 48 01 lea 0x1(%rax),%ecx b7720: 8b 44 24 1c mov 0x1c(%rsp),%eax b7724: f7 f1 div %ecx b7726: 31 d2 xor %edx,%edx b7728: 89 c1 mov %eax,%ecx b772a: 8b 44 24 10 mov 0x10(%rsp),%eax b772e: f7 f1 div %ecx

The b7718 is very like following source code: https://github.com/open-mpi/hwloc/blob/63a8288d31a1baf67a909466aba9a022c78ca7b1/hwloc/topology-x86.c#L727

But I can't map the assembly to the c source code :-(

It might related to https://github.com/open-mpi/hwloc/issues/525.

The issue should be related to cpuid ax=4, cx = 0. Following is a test which prove when disable that, the issue could be skipped.

https://github.com/QuarkContainer/Quark/commit/e65898c2942f70e6b7d47a9efaf509d18dac50fe

chengchen666 commented 1 month ago

https://github.com/QuarkContainer/Quark/blob/gpu-multiprocessing/test/multiprocess_torchminimal.py also failed. Test program is in branch:gpu-multiprocessing. just run python3 multiprocess_torchminimal.py will launch two cpu program by using Ray. So this program requires an image with Ray preinstalled

Sounds like making Ray work is easier. To make an env with ray and reproduce, follow this commit https://github.com/QuarkContainer/Quark/commit/a711f35b6706f004cc0f579a09926af821e84131

shrik3 commented 1 month ago

The b7718 is very like following source code: https://github.com/open-mpi/hwloc/blob/63a8288d31a1baf67a909466aba9a022c78ca7b1/hwloc/topology-x86.c#L727

But I can't map the assembly to the c source code :-(

It might related to open-mpi/hwloc#525.

The issue should be related to cpuid ax=4, cx = 0. Following is a test which prove when disable that, the issue could be skipped.

e65898c

Their code does check zero value, so the b772e: f7 f1 div %ecx should not happen ... I'm confused...

https://github.com/open-mpi/hwloc/blob/63a8288d31a1baf67a909466aba9a022c78ca7b1/hwloc/topology-x86.c#L730

shrik3 commented 1 month ago

more verbose logging on mpirun

mpirun -n 3 --prtemca rmaps_base_verbose 10 --display alloc --output tag ls

shrik3 commented 1 month ago

https://github.com/open-mpi/hwloc/wiki/The-shape-of-VM-to-come

shrik3 commented 1 month ago

for the record, I tested the mfisherman/openmpi container, the mpirun(perhaps newer library version than the pytorch one?) doesn't have the div-by-zero condition.

However both quark and gvisor fails to run any program using mpirun, because the cpu topology is not correctly detected (see #1291). There is no easy fix at at the moment.

@chengchen666 could you test running the same mpirun command with --map-by slot:oversubscribe ? This is working for me (not the correct way but at least no error) , @QuarkContainer has a workaround for the div-by-zero condition

QuarkContainer commented 1 month ago

@chengchen666 I add some workarunds in the branch GPUVirtMPI.

To get the MPI runable, we also need to use following commandline.

mpirun --host localhost,localhost -np 2 --allow-run-as-root python3 /Quark/test/pytorch_minimal.py

chengchen666 commented 1 month ago

GPUVirtMPI branch works for me. Thanks

QuarkContainer commented 1 month ago

@chengchen666 I tried the Ray issue in the GPUVirtNew branch and can't repro the "Address" issue. The log is as below.

root@brad-MS-7D46:/var/log/quark# rm quark.log; docker run -it --runtime=quark_d -v /home/brad/rust/Quark/test:/test --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 rayllm bash -c "python3 /test/multiprocess_torchminimal.py"

============= == PyTorch ==

NVIDIA Release 23.09 (build 69180607) PyTorch Version 2.1.0a0+32f93b1

Copyright (c) 2014-2023 Facebook Inc. Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert) Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu) Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu) Copyright (c) 2011-2013 NYU (Clement Farabet) Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston) Copyright (c) 2006 Idiap Research Institute (Samy Bengio) Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz) Copyright (c) 2015 Google Inc. Copyright (c) 2015 Yangqing Jia Copyright (c) 2013-2016 The Caffe contributors All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ .

2024-05-28 13:20:33,634 INFO worker.py:1749 -- Started a local Ray instance. Traceback (most recent call last): File "/test/multiprocess_torchminimal.py", line 13, in @ray.remote() File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 3431, in remote assert len(args) == 0 and len(kwargs) > 0, ray_option_utils.remote_args_error_string AssertionError: The @ray.remote decorator must be applied either with no arguments and no parentheses, for example '@ray.remote', or it must be applied using some of the arguments in the list ['max_calls', 'max_retries', 'num_cpus', 'num_returns', 'object_store_memory', 'retry_exceptions', '_generator_backpressure_num_objects', 'concurrency_groups', 'lifetime', 'max_concurrency', 'max_restarts', 'max_task_retries', 'max_pending_calls', 'namespace', 'get_if_exists', 'accelerator_type', 'memory', 'name', 'num_gpus', 'placement_group', 'placement_group_bundle_index', 'placement_group_capture_child_tasks', 'resources', 'runtime_env', 'scheduling_strategy', '_metadata', 'enable_task_events'], for example '@ray.remote(num_returns=2, resources={"CustomResource": 1})'.

QuarkContainer commented 1 month ago

@chengchen666 the mpirun bug fix has been merged with commit https://github.com/QuarkContainer/Quark/pull/1298.

Now we don't need to add "--host localhost,localhost" in the mpirun commandline.

QuarkContainer / Quark

MPI Not Working in Quark #1281

============= == PyTorch ==