ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
415 stars 157 forks source link

cuda device with a number ignored #301

Open bernstei opened 5 months ago

bernstei commented 5 months ago

From what we can tell from playing with it, passing devices such as cuda:2 to run_train.py doesn't seem to work - it appears to still use device 0 (Note that I had to patch the CLI argument parser to allow strings like cuda:N, which I'd be happy to share). I'd have expected to see torch.cuda.set_device(N) someplace, e.g. in https://github.com/ACEsuit/mace/blob/6df88277a2971a819b1d6177e9acbd7dc76b7c54/mace/tools/torch_tools.py#L51 Instead it looks like the device string including the :N is passed to various torch calls throughout the code.

Has anyone actually tested this functionality?

Note that setting CUDA_VISIBLE_DEVICES before running run_train is sufficient for us, so maybe it's not important and this issue can be closed, but having code that does the wrong thing seems bad.

chiku-parida commented 3 months ago

Could you please share what are specific tags and modifications needed to run multi-gpu training?