ORNL / HydraGNN

Distributed PyTorch implementation of multi-headed graph convolutional neural networks
BSD 3-Clause "New" or "Revised" License
68 stars 29 forks source link

gather_deg_dist needs to map 'deg' variable to GPUs before calling NCCL gathering operations #180

Closed allaffa closed 1 year ago

allaffa commented 1 year ago

All the data loaders use the pin_memory=True, which can work only if the data is stored on CPUs. If the entire data were stored on the GPU, pin_memory=True would make the code crash

Since all our data is stored on CPU, the max_deg and deg variables are also stored on the CPU. Therefore, the NCCL gathering operations crash because they require the variables to be on the device.

This PR fixed this problem.

allaffa commented 1 year ago

@pzhanggit @jychoi-hpc this PR is really for review.