Open flbbb opened 1 year ago
Hi @flbbb
Now, (from commit b08e0af1a47adb51083b24be8ece24fff5808099) the cache directory of pykeops encode both the hostname of the machine and the CUDA_VISIBLE_DEVICE env. These different cache dir are made to avoid conflict when running multiple instance of keops.
is your home folder shared through all your node ? Does the mentioned commit help ?
I also have problems running PyKeops with DDP. But it fails even on single process.
[pyKeOps] Compiling libKeOpstorchb9f42b1246 in /home/domain/geraseva/.cache/pykeops-1.5-cpython-37:
formula: Sum_Reduction(TensorProd(Exp((Minus(Sum(Square((Var(0,3,0) - Var(1,3,1))))) / (IntCst(2) * Square(Var(2,5,2))))), Var(3,3,1)),0)
aliases: Var(0,3,0); Var(1,3,1); Var(2,5,2); Var(3,3,1);
dtype : float32
...
No such file or directory
CMake Error: Generator: execution of make failed. Make command was: /usr/bin/gmake -f Makefile VERBOSE=1 KeOps_formula &&
--------------------- MAKE DEBUG -----------------
Command '['cmake', '--build', '.', '--target', 'KeOps_formula', '--', 'VERBOSE=1']' returned non-zero exit status 1.
--------------------- ----------- -----------------
0%| | 0/2139 [00:00<?, ?it/s]
Traceback (most recent call last):
File "ddp_training.py", line 245, in <module>
mp.spawn(main, args=(rank_list, args, net_args), nprocs=len(rank_list))
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/mnt/storage/geraseva/dMaSIF/ddp_training.py", line 222, in main
trainer.train(starting_epoch)
File "/mnt/storage/geraseva/dMaSIF/ddp_training.py", line 109, in train
self._run_epoch(i)
File "/mnt/storage/geraseva/dMaSIF/ddp_training.py", line 75, in _run_epoch
epoch_number=epoch,
File "/mnt/storage/geraseva/dMaSIF/data_iteration.py", line 245, in iterate
outputs = net(P1_batch, P2_batch)
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 963, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/storage/geraseva/dMaSIF/model.py", line 570, in forward
conv_time, memory_usage = self.embed(P1P2)
File "/mnt/storage/geraseva/dMaSIF/model.py", line 508, in embed
features = self.dropout(self.features(P))
File "/mnt/storage/geraseva/dMaSIF/model.py", line 494, in features
batch=P["xyz_batch"],
File "/mnt/storage/geraseva/dMaSIF/geometry_processing.py", line 484, in curvatures
vertices, triangles=triangles, normals=normals, scale=scales, batch=batch
File "/mnt/storage/geraseva/dMaSIF/geometry_processing.py", line 399, in mesh_normals_areas
U = (K_ij.tensorprod(v_j)).sum(dim=1) # (N, S*3)
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 1773, in sum
return self.reduction("Sum", axis=axis, **kwargs)
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 744, in reduction
return res()
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/lazy_tensor.py", line 929, in __call__
return self.callfun(*args, *self.variables, **self.kwargs)
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/torch/generic/generic_red.py", line 579, in __call__
*args
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/torch/generic/generic_red.py", line 48, in forward
formula, aliases, dtype, "torch", optional_flags, include_dirs
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/keops_io.py", line 48, in __init__
self._safe_compile()
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/utils.py", line 75, in wrapper_filelock
func_res = func(*args, **kwargs)
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/keops_io.py", line 63, in _safe_compile
self.build_folder,
File "/home/domain/data/prog/miniconda3/envs/dmasif/lib/python3.7/site-packages/pykeops/common/compile_routines.py", line 260, in compile_generic_routine
template_build_folder + os.path.sep + fname,
FileNotFoundError: [Errno 2] No such file or directory: '/home/domain/geraseva/.cache/pykeops-1.5-cpython-37//KeOps_formula.o' -> '/home/domain/geraseva/.cache/pykeops-1.5-cpython-37//build-pybind11_template-libKeOps_template_eebada125f/KeOps_formula.o'
Hi, I have already used PyKeops with Pytorch DistributedDataParallel, but now I am using my script on another remote server, with SLURM (that I didn't have for the previous experiment).
Eventhough I am unsure it is linked to SLURM, I will describe everything I did to run the script.
srun ... --gpus-per-node=4 --pty bash
to get an interactive session, with only 1 node.torchrun --nproc_per_node=1 script_training.py
. It worked.torchrun --nproc_per_node=4 script_training.py
, I got this error:I think the error might come from my side but I would appreciate any help, thank you!
(BTW this library is awesome)