Closed JonathanSchmidt1 closed 10 months ago
Thanks for submitting this issue! We'll have this fixed in an upcoming PR.
I still have some problems getting matgl to run. To get it to run on multiple gpus I had to change some pytorch code to force some generators to be on the gpu however I cannot get it to run with multiple workers for the dataloading this way. Do you manage to train matgl on multiple gpus using a non-zero worker number? If yes which torch, dgl, matgl versions are you using?
@JonathanSchmidt1 can you open a new issue, and share your script/code changes you needed to make in order to make it run? In the issue, please share the error messages as well.
@melo-gonzo and I can help diagnose.
@JonathanSchmidt1 Thanks for bringing this up. This is an issue I have come across for me as well, and there's a few ways I've been able to get around it. For starters, the matgl team as recommended adding torch.set_default_device("cuda")
at the top of m3gnet training scripts, so we're putting together a PR to include that in the example. Unfortunately, I have not been able to resolve the num_workers issue using this method.
To get training running with multiple gpu's and num_workers>0 I had to create a new personal branch and add in some tensor placement calls where needed. Going this route lets gpu training run without extra calls (torch.set_default_device("cuda")
) and lets you set num_workers.
Happy to keep looking for more solutions. @laserkelvin may have some idea's to try out as well.
Expected behavior
m3gnet_dgl example runs without error
Actual behavior
the example crashes during the first epoch