Closed rauteric closed 1 day ago
bot:aws:retest
Force-push to 98bf2b1 is a rebase on master.
Force-push to 8850320 addresses (my own) feedback
(Note: the CI failure is spurious. Will try again on next revision.)
Resolved ✅
Somehow I majorly butchered performance in yesterday's update to this PR, when ENDPOINT_PER_COMM=1
. Need to sort that out.
Edit: yet another lesson to use NCCL_NET="AWS Libfabric"
for my testing. I was failing during init and falling through to sockets provider and not realizing it.
Moved rail index assignment logic to platform hook and added AWS/EFA implementation.
Force-push to 73fab34 addressed feedback.
Force-push to 205a026 is rebase on master.
Rebased master (fix for CodeChecker workflow)
bot:aws:retest
bot:aws:retest
bot:aws:retest
This patch changes the plugin's endpoint creation behavior as follows:
Also has some supporting refactoring commits.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.