Closed allaffa closed 7 months ago
@pzhanggit The fixes provided by this PR allow us to have a working baseline graph auto encoder in HydraGNN
I changed like 225 of the file hydragnn/utils/distributed.py
as follows:
model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
Setting find_unused_parameters=True
does not make the code crash, and should allow to track the parameters.
I also added the following lined to the train() function inside train_validate_test.py
with record_function("forward"):
data = data.to(get_device())
pred = model(data)
# Print unused parameters
unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
if unused_params:
print("Unused Parameters:")
for name in unused_params:
print(name)
else:
print("No unused parameters.")
loss, tasks_loss = model.module.loss(pred, data.y, head_index)
However, no unused parameters are tracked.
I changed like 225 of the file
hydragnn/utils/distributed.py
as follows:model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)
Setting
find_unused_parameters=True
does not make the code crash, and should allow to track the parameters.I also added the following lined to the train() function inside
train_validate_test.py
with record_function("forward"): data = data.to(get_device()) pred = model(data) # Print unused parameters unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad] if unused_params: print("Unused Parameters:") for name in unused_params: print(name) else: print("No unused parameters.") loss, tasks_loss = model.module.loss(pred, data.y, head_index)
However, no unused parameters are tracked.
https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5
I changed like 225 of the file
hydragnn/utils/distributed.py
as follows: model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True) Settingfind_unused_parameters=True
does not make the code crash, and should allow to track the parameters. I also added the following lined to the train() function insidetrain_validate_test.py
with record_function("forward"): data = data.to(get_device()) pred = model(data) # Print unused parameters unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad] if unused_params: print("Unused Parameters:") for name in unused_params: print(name) else: print("No unused parameters.") loss, tasks_loss = model.module.loss(pred, data.y, head_index)
However, no unused parameters are tracked.
https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5
@pzhanggit thanks for the link.
If you do not use DDP, then you need to call model.parameters()
as in the link you provided. If we use DDP, then replacing model
with model.module
(as we already do in other parts of the code) and calling model.module.parameters()
should do the same.
Am I missing something?
@JustinBakerMath
This PR fixed some bugs introduced by successive code developments and re-establishes the capabilities to create a graph auto encoder in HydraGNN that solely relies on message passing layers for node-to-node mapping predictions. This is not meant to replace the shared MLP, it just aims at providing the user with an additional (optional) architecture for nodal predictions. This way, the user can eider choose to build (1) a stack of message passing layers with a shared MLP on top, or (2) use message passing layers all the way till the end.
The performance of the graph auto encoder using only message passing layers is pretty disappointing on the unit tests. However, this is in line with previous runs observed by Pei a long time ago on the FePt dataset. Since we want to further develop generative graph diffusion models on top of HydraGNN, I would like to see how graph auto encoders behave in that context.
Would you mind helping me make sure that my Changs do not mess up the SchNet layer? I remember that for some equivariant models you turn off the batch-norm. I would like to get your help make sure that I did not add bugs when the graph auto encoder uses equivariant message passing backend.
Thanks,
@JustinBakerMath
This PR fixed some bugs introduced by successive code developments and re-establishes the capabilities to create a graph auto encoder in HydraGNN that solely relies on message passing layers for node-to-node mapping predictions. This is not meant to replace the shared MLP, it just aims at providing the user with an additional (optional) architecture for nodal predictions. This way, the user can eider choose to build (1) a stack of message passing layers with a shared MLP on top, or (2) use message passing layers all the way till the end.
The performance of the graph auto encoder using only message passing layers is pretty disappointing on the unit tests. However, this is in line with previous runs observed by Pei a long time ago on the FePt dataset. Since we want to further develop generative graph diffusion models on top of HydraGNN, I would like to see how graph auto encoders behave in that context.
Would you mind helping me make sure that my Changs do not mess up the SchNet layer? I remember that for some equivariant models you turn off the batch-norm. I would like to get your help make sure that I did not add bugs when the graph auto encoder uses equivariant message passing backend.
Thanks,
Thank you for your patience.
These updates have not introduced any errors into the implementation of EGCL or SchNet.
It has added batch normalization to the convolutional "head". The performance of the batch normalization can be assessed by overriding the _init_node_conv()
function in the children of Base
. This would be done in the same fashion as the existing overrides for the _init_conv()
function.
This isn't necessary for this PR. However, going forward, I would be happy to assist in assessing the batch normalization performance.
The tests failed due to that the changes introduced requiring torch-geometric>=2.4.0 which stops supporting python 3.7, as in https://github.com/pyg-team/pytorch_geometric/pull/7939. We will come back to the PR later.
@pzhanggit Do you think that the new upgraded versions of PR#210 (now merged into the master) would allow for the tests of this PR to pass after rebasing?
@pzhanggit Do you think that the new upgraded versions of PR#210 (now merged into the master) would allow for the tests of this PR to pass after rebasing?
Yes, I will rebase now
PYG released a new version (https://github.com/pyg-team/pytorch_geometric/releases/tag/2.5.2), which includes the fix for the problem. Can we try if pyg 2.5.2 works?
torch_geometric==2.5.2
The capability to use a full stack of convolutional layers when only nodal predictions are needed was never tested in the code. This PR:
type="conv"
as a choice for nodal heads of HydraGNNget_conv
is called with the proper number of arguments. All models except SCFStack take two input arguments. SCFStack takes alsolast_layer
, which is a Boolean variable.