ORNL / HydraGNN

Distributed PyTorch implementation of multi-headed graph convolutional neural networks
BSD 3-Clause "New" or "Revised" License
61 stars 27 forks source link

Bug Fix: Preprocessing and Training with the OGB PCQM4Mv2 Dataset #262

Closed LemonAndRabbit closed 3 months ago

LemonAndRabbit commented 3 months ago

We have identified several issues with the original implementation in examples/ogb/train_gap.py that require modifications:

  1. Incompatibility with the Current OGB PCQM4Mv2 Dataset: The current version of the OGB PCQM4Mv2 dataset includes atoms not listed in ogb_node_types and contains entries with empty labels. We skipped these incompatible entries in the preprocessing code.
  2. Failed Instantiation of AdiosDataset: The code currently instantiates AdiosDataset with incompatible parameters. The opt dictionary should be unpacked before being passed as arguments.
  3. Broadcasting Over 2GB Data with MPI: The Adios_writer class occasionally attempts to broadcast over 2GB of data, exceeding the MPI message count limit. We have implemented a chunk-based broadcasting function to address this issue.

These bug fixes are essential for later integrating our DeepSpeed and pipeline-parallelism implementations, which use the OGB PCQM4Mv2 dataset as an example.