Closed thvasilo closed 4 months ago
To trigger regression tests:
@dgl-bot run [instance-type] [which tests] [compare-with-branch]
;
For example: @dgl-bot run g4dn.4xlarge all dmlc/master
or @dgl-bot run c5.9xlarge kernel,api dmlc/master
Hi @Rhett-Ying fixed the lint errors, is this fine to merge now?
Description
In some edge cases where workers end up with no rows of features to send over the network (e.g. in range partition) the existing code was creating tensors of shape (0, 0) and trying to communicate/aggregate those over the network, whereas workers that had non-zero rows assigned would communicate (num_rows, feature_dimension), leading to an assertion error being triggered in https://github.com/dmlc/dgl/blob/6f2ccbff3c94cb3f5767bdfef88f5b535d6843d3/tools/distpartitioning/gloo_wrapper.py#L144-L146
This PR handles the 2D tensor case as a special case, ensuring the correct shape for the tensors being sent over the network, (0, feature_dimension).
This PR also removes an array that was being created but never used, reducing the memory footprint of the function.
Ping @Rhett-Ying for review
Checklist
Please feel free to remove inapplicable items for your PR.
Changes