aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

Separate endpoints for recv communicators from same source endpoint #438

Closed rauteric closed 1 day ago

rauteric commented 3 weeks ago

This patch changes the plugin's endpoint creation behavior as follows:

  1. All send communicators will use the same endpoint.
  2. When creating an endpoint for receive communicators, first check if any existing endpoint is not already connected to the source endpoint. Only create a new recv endpoint if all existing endpoints are already connected to the source endpoint.

Also has some supporting refactoring commits.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

AmedeoSapio commented 3 weeks ago

bot:aws:retest

rauteric commented 3 weeks ago

Force-push to 98bf2b1 is a rebase on master.

rauteric commented 3 weeks ago

Force-push to 8850320 addresses (my own) feedback

rauteric commented 3 weeks ago

(Note: the CI failure is spurious. Will try again on next revision.)

rauteric commented 3 weeks ago

Resolved ✅ Somehow I majorly butchered performance in yesterday's update to this PR, when ENDPOINT_PER_COMM=1. Need to sort that out.

Edit: yet another lesson to use NCCL_NET="AWS Libfabric" for my testing. I was failing during init and falling through to sockets provider and not realizing it.

rauteric commented 3 weeks ago

Moved rail index assignment logic to platform hook and added AWS/EFA implementation.

rauteric commented 1 week ago

Force-push to 73fab34 addressed feedback.

rauteric commented 1 week ago

Force-push to 205a026 is rebase on master.

rauteric commented 1 week ago
rauteric commented 1 week ago

Rebased master (fix for CodeChecker workflow)

rauteric commented 6 days ago

bot:aws:retest

sunkuamzn commented 2 days ago

bot:aws:retest

a-szegel commented 1 day ago

bot:aws:retest