getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.03k stars 65 forks source link

Error when using DDP with more than one process #289

Open flbbb opened 1 year ago

flbbb commented 1 year ago

Hi, I have already used PyKeops with Pytorch DistributedDataParallel, but now I am using my script on another remote server, with SLURM (that I didn't have for the previous experiment).

Eventhough I am unsure it is linked to SLURM, I will describe everything I did to run the script.

  1. I did srun ... --gpus-per-node=4 --pty bash to get an interactive session, with only 1 node.
  2. In the interactive shell I launched torchrun --nproc_per_node=1 script_training.py. It worked.
  3. I tried with more GPUs torchrun --nproc_per_node=4 script_training.py, I got this error:
    Traceback (most recent call last):
    File "/home/flbbb/projects/project/models.py", line 12, in <module>
    from pykeops.torch import LazyTensor
    File "/home/flbbb/.conda/envs/my_env/lib/python3.10/site-packages/pykeops/__init__.py", line 3, in <module>
    import keopscore
    File "/home/flbbb/.conda/envs/my_env/lib/python3.10/site-packages/keopscore/__init__.py", line 14, in <module>
    from .config.config import set_build_folder, get_build_folder
    File "/home/flbbb/.conda/envs/my_env/lib/python3.10/site-packages/keopscore/config/config.py", line 207, in <module>
    cuda_include_path = get_cuda_include_path()
    File "/home/flbbb/.conda/envs/my_env/lib/python3.10/site-packages/keopscore/utils/gpu_utils.py", line 71, in get_cuda_include_path
    path_cudah = get_include_file_abspath("cuda.h")
    File "/home/flbbb/.conda/envs/my_env/lib/python3.10/site-packages/keopscore/utils/gpu_utils.py", line 95, in get_include_file_abspath
    strings = open(tmp_file).read().split()
    FileNotFoundError: [Errno 2] No such file or directory: '/home/flbbb/.cache/keops2.1/build_CUDA_VISIBLE_DEVICES_0_1_2_3/tmp.txt'

I think the error might come from my side but I would appreciate any help, thank you!

(BTW this library is awesome)

bcharlier commented 1 year ago

Hi @flbbb

Now, (from commit b08e0af1a47adb51083b24be8ece24fff5808099) the cache directory of pykeops encode both the hostname of the machine and the CUDA_VISIBLE_DEVICE env. These different cache dir are made to avoid conflict when running multiple instance of keops.

is your home folder shared through all your node ? Does the mentioned commit help ?

geraseva commented 1 year ago

I also have problems running PyKeops with DDP. But it fails even on single process.

[pyKeOps] Compiling libKeOpstorchb9f42b1246 in /home/domain/geraseva/.cache/pykeops-1.5-cpython-37:
       formula: Sum_Reduction(TensorProd(Exp((Minus(Sum(Square((Var(0,3,0) - Var(1,3,1))))) / (IntCst(2) * Square(Var(2,5,2))))), Var(3,3,1)),0)
       aliases: Var(0,3,0); Var(1,3,1); Var(2,5,2); Var(3,3,1); 
       dtype  : float32
... 
No such file or directory
CMake Error: Generator: execution of make failed. Make command was: /usr/bin/gmake -f Makefile VERBOSE=1 KeOps_formula && 

--------------------- MAKE DEBUG -----------------
Command '['cmake', '--build', '.', '--target', 'KeOps_formula', '--', 'VERBOSE=1']' returned non-zero exit status 1.

--------------------- ----------- -----------------
  0%|                                                                                                | 0/2139 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "ddp_training.py", line 245, in <module>
    mp.spawn(main, args=(rank_list, args, net_args), nprocs=len(rank_list))
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/mnt/storage/geraseva/dMaSIF/ddp_training.py", line 222, in main
    trainer.train(starting_epoch)
  File "/mnt/storage/geraseva/dMaSIF/ddp_training.py", line 109, in train
    self._run_epoch(i)
  File "/mnt/storage/geraseva/dMaSIF/ddp_training.py", line 75, in _run_epoch
    epoch_number=epoch,
  File "/mnt/storage/geraseva/dMaSIF/data_iteration.py", line 245, in iterate
    outputs = net(P1_batch, P2_batch)
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/storage/geraseva/dMaSIF/model.py", line 570, in forward
    conv_time, memory_usage = self.embed(P1P2)
  File "/mnt/storage/geraseva/dMaSIF/model.py", line 508, in embed
    features = self.dropout(self.features(P))
  File "/mnt/storage/geraseva/dMaSIF/model.py", line 494, in features
    batch=P["xyz_batch"],
  File "/mnt/storage/geraseva/dMaSIF/geometry_processing.py", line 484, in curvatures
    vertices, triangles=triangles, normals=normals, scale=scales, batch=batch
  File "/mnt/storage/geraseva/dMaSIF/geometry_processing.py", line 399, in mesh_normals_areas
    U = (K_ij.tensorprod(v_j)).sum(dim=1)  # (N, S*3)
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 1773, in sum
    return self.reduction("Sum", axis=axis, **kwargs)
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 744, in reduction
    return res()
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 929, in __call__
    return self.callfun(*args, *self.variables, **self.kwargs)
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/torch/generic/generic_red.py", line 579, in __call__
    *args
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/torch/generic/generic_red.py", line 48, in forward
    formula, aliases, dtype, "torch", optional_flags, include_dirs
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/keops_io.py", line 48, in __init__
    self._safe_compile()
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/utils.py", line 75, in wrapper_filelock
    func_res = func(*args, **kwargs)
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/keops_io.py", line 63, in _safe_compile
    self.build_folder,
  File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/compile_routines.py", line 260, in compile_generic_routine
    template_build_folder + os.path.sep + fname,
FileNotFoundError: [Errno 2] No such file or directory: '/home/domain/geraseva/.cache/pykeops-1.5-cpython-37//KeOps_formula.o' -> '/home/domain/geraseva/.cache/pykeops-1.5-cpython-37//build-pybind11_template-libKeOps_template_eebada125f/KeOps_formula.o'