ORNL / HydraGNN

Distributed PyTorch implementation of multi-headed graph convolutional neural networks
BSD 3-Clause "New" or "Revised" License
68 stars 29 forks source link

Graph heads ci tests #208

Closed allaffa closed 7 months ago

allaffa commented 10 months ago

The capability to use a full stack of convolutional layers when only nodal predictions are needed was never tested in the code. This PR:

allaffa commented 10 months ago

@pzhanggit The fixes provided by this PR allow us to have a working baseline graph auto encoder in HydraGNN

allaffa commented 10 months ago

I changed like 225 of the file hydragnn/utils/distributed.py as follows:

model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)

Setting find_unused_parameters=True does not make the code crash, and should allow to track the parameters.

I also added the following lined to the train() function inside train_validate_test.py

        with record_function("forward"):
            data = data.to(get_device())
            pred = model(data)
            # Print unused parameters
            unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
            if unused_params:
                print("Unused Parameters:")
                for name in unused_params:
                    print(name)
            else:
                print("No unused parameters.")
            loss, tasks_loss = model.module.loss(pred, data.y, head_index)

However, no unused parameters are tracked.

pzhanggit commented 10 months ago

I changed like 225 of the file hydragnn/utils/distributed.py as follows:

model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True)

Setting find_unused_parameters=True does not make the code crash, and should allow to track the parameters.

I also added the following lined to the train() function inside train_validate_test.py

        with record_function("forward"):
            data = data.to(get_device())
            pred = model(data)
            # Print unused parameters
            unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
            if unused_params:
                print("Unused Parameters:")
                for name in unused_params:
                    print(name)
            else:
                print("No unused parameters.")
            loss, tasks_loss = model.module.loss(pred, data.y, head_index)

However, no unused parameters are tracked.

https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5

allaffa commented 10 months ago

I changed like 225 of the file hydragnn/utils/distributed.py as follows: model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True) Setting find_unused_parameters=True does not make the code crash, and should allow to track the parameters. I also added the following lined to the train() function inside train_validate_test.py

        with record_function("forward"):
            data = data.to(get_device())
            pred = model(data)
            # Print unused parameters
            unused_params = [name for name, param in model.module.named_parameters() if not param.requires_grad]
            if unused_params:
                print("Unused Parameters:")
                for name in unused_params:
                    print(name)
            else:
                print("No unused parameters.")
            loss, tasks_loss = model.module.loss(pred, data.y, head_index)

However, no unused parameters are tracked.

https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/5

@pzhanggit thanks for the link.

If you do not use DDP, then you need to call model.parameters() as in the link you provided. If we use DDP, then replacing modelwith model.module (as we already do in other parts of the code) and calling model.module.parameters() should do the same.

Am I missing something?

allaffa commented 10 months ago

@JustinBakerMath

This PR fixed some bugs introduced by successive code developments and re-establishes the capabilities to create a graph auto encoder in HydraGNN that solely relies on message passing layers for node-to-node mapping predictions. This is not meant to replace the shared MLP, it just aims at providing the user with an additional (optional) architecture for nodal predictions. This way, the user can eider choose to build (1) a stack of message passing layers with a shared MLP on top, or (2) use message passing layers all the way till the end.

The performance of the graph auto encoder using only message passing layers is pretty disappointing on the unit tests. However, this is in line with previous runs observed by Pei a long time ago on the FePt dataset. Since we want to further develop generative graph diffusion models on top of HydraGNN, I would like to see how graph auto encoders behave in that context.

Would you mind helping me make sure that my Changs do not mess up the SchNet layer? I remember that for some equivariant models you turn off the batch-norm. I would like to get your help make sure that I did not add bugs when the graph auto encoder uses equivariant message passing backend.

Thanks,

JustinBakerMath commented 10 months ago

@JustinBakerMath

This PR fixed some bugs introduced by successive code developments and re-establishes the capabilities to create a graph auto encoder in HydraGNN that solely relies on message passing layers for node-to-node mapping predictions. This is not meant to replace the shared MLP, it just aims at providing the user with an additional (optional) architecture for nodal predictions. This way, the user can eider choose to build (1) a stack of message passing layers with a shared MLP on top, or (2) use message passing layers all the way till the end.

The performance of the graph auto encoder using only message passing layers is pretty disappointing on the unit tests. However, this is in line with previous runs observed by Pei a long time ago on the FePt dataset. Since we want to further develop generative graph diffusion models on top of HydraGNN, I would like to see how graph auto encoders behave in that context.

Would you mind helping me make sure that my Changs do not mess up the SchNet layer? I remember that for some equivariant models you turn off the batch-norm. I would like to get your help make sure that I did not add bugs when the graph auto encoder uses equivariant message passing backend.

Thanks,

Thank you for your patience.

These updates have not introduced any errors into the implementation of EGCL or SchNet.

It has added batch normalization to the convolutional "head". The performance of the batch normalization can be assessed by overriding the _init_node_conv() function in the children of Base. This would be done in the same fashion as the existing overrides for the _init_conv() function.

This isn't necessary for this PR. However, going forward, I would be happy to assist in assessing the batch normalization performance.

pzhanggit commented 9 months ago

The tests failed due to that the changes introduced requiring torch-geometric>=2.4.0 which stops supporting python 3.7, as in https://github.com/pyg-team/pytorch_geometric/pull/7939. We will come back to the PR later.

allaffa commented 8 months ago

@pzhanggit Do you think that the new upgraded versions of PR#210 (now merged into the master) would allow for the tests of this PR to pass after rebasing?

pzhanggit commented 8 months ago

@pzhanggit Do you think that the new upgraded versions of PR#210 (now merged into the master) would allow for the tests of this PR to pass after rebasing?

Yes, I will rebase now

jychoi-hpc commented 8 months ago

PYG released a new version (https://github.com/pyg-team/pytorch_geometric/releases/tag/2.5.2), which includes the fix for the problem. Can we try if pyg 2.5.2 works?

torch_geometric==2.5.2