ORNL / HydraGNN

Distributed PyTorch implementation of multi-headed graph convolutional neural networks
BSD 3-Clause "New" or "Revised" License
61 stars 27 forks source link

OC2020 check forces #253

Closed allaffa closed 3 months ago

allaffa commented 3 months ago

Functionalities to check on value of atomic forces added to example with Open Catalyst 2020 dataset

frobnitzem commented 3 months ago

LGTM.

allaffa commented 3 months ago

@jychoi-hpc I am trying to re-generate the ADIOS2 binary file with the changes applied by this PR. However, the code on OLCF-Frontier seems to stall at the saving stage without even completing, even when I reserve 50 nodes for 6 hours. May you please try to run the PR yourself and tell me if you also notice the same behavior?

jychoi-hpc commented 3 months ago

I see. I will run from my side and check.

jychoi-hpc commented 3 months ago

@allaffa I found a bug in the code. It is an Adios write routine to handle a situation when one process has no data to write but others has. It was introduced after a fix made during the GB run (#241).

Can you add one line in hydragnn/utils/adiosdataset.py as follows and try again?

diff --git a/hydragnn/utils/adiosdataset.py b/hydragnn/utils/adiosdataset.py
index 324bb40..79d2214 100644
--- a/hydragnn/utils/adiosdataset.py
+++ b/hydragnn/utils/adiosdataset.py
@@ -107,6 +107,7 @@ class AdiosWriter:
                         break

                 for k in keys:
+                    vdim = self.comm.allreduce(0, op=MPI.MAX)
                     shape_list = self.comm.allgather([])

                 continue
jychoi-hpc commented 3 months ago

This is a new bug caused by #241. The above fix is not added anywhere. You can add the line in your local code. If it works, just commit. It will show up here again.

allaffa commented 3 months ago

@jychoi-hpc PR#255 (already merged) fixed this! Thank you :) I think we are good.