dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.41k stars 3k forks source link

[DistPart] Fix corner case in dist partition which always led to an assertion error being triggered. #7395

Closed thvasilo closed 4 months ago

thvasilo commented 4 months ago

Description

In some edge cases where workers end up with no rows of features to send over the network (e.g. in range partition) the existing code was creating tensors of shape (0, 0) and trying to communicate/aggregate those over the network, whereas workers that had non-zero rows assigned would communicate (num_rows, feature_dimension), leading to an assertion error being triggered in https://github.com/dmlc/dgl/blob/6f2ccbff3c94cb3f5767bdfef88f5b535d6843d3/tools/distpartitioning/gloo_wrapper.py#L144-L146

This PR handles the 2D tensor case as a special case, ensuring the correct shape for the tensors being sent over the network, (0, feature_dimension).

This PR also removes an array that was being created but never used, reducing the memory footprint of the function.

Ping @Rhett-Ying for review

Checklist

Please feel free to remove inapplicable items for your PR.

Changes

dgl-bot commented 4 months ago

To trigger regression tests:

dgl-bot commented 4 months ago

Commit ID: 8971a87fc323285a29ef38fa6b185519209a6a22

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 4 months ago

Commit ID: 5fc034371a9263198b37a20578aea1bf0da6c6fd

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot commented 4 months ago

Commit ID: b7fb8403de6d46185981396770f507d20cfc2c12

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

thvasilo commented 4 months ago

Hi @Rhett-Ying fixed the lint errors, is this fine to merge now?