issues
search
aws
/
aws-ofi-nccl
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
147
stars
56
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
torch.distributed.DistBackendError: NCCL error
#715
Chevolier
opened
15 hours ago
0
fix: Fallback to internal tuner on NCCL-2.21.5 for PAT
#714
arunkarthik-akkart
closed
1 day ago
2
platform-aws: bump libfabric requirement to 1.22
#713
aws-nslick
closed
1 day ago
3
release: v1.13.x aws
#712
aws-nslick
closed
1 day ago
0
tree: cleanup "gdr_support" variable
#711
aws-nslick
opened
1 day ago
0
core: Leave endpoint created during init
#710
bwbarrett
closed
1 day ago
2
lttng: tracepoints for eager and ctrl messages
#709
AmedeoSapio
opened
3 days ago
0
fix(cuda): delete broken using directives
#708
aws-nslick
opened
4 days ago
0
mr: add comment clarifying fi_mr_attr/ckey punning
#707
aws-nslick
opened
4 days ago
1
fix: ep release in endpoint per comm
#706
AmedeoSapio
closed
6 days ago
0
rdma: Set FI_MORE when posting receive buffers
#705
bwbarrett
closed
5 days ago
1
feat: Region-based tuner support for P5en
#704
arunkarthik-akkart
closed
1 day ago
3
reenable dmabuf by default
#703
aws-nslick
closed
5 days ago
0
rdma: Set LOW_LATENCY traffic class for control
#702
bwbarrett
closed
5 days ago
3
.ci/aws: Improve CI Speed
#701
a-szegel
closed
1 week ago
3
rdma: fix cq usage on different domains
#700
maxtmann
closed
5 days ago
1
wip: packaging scripts
#699
aws-nslick
opened
2 weeks ago
0
[v1.12.x-aws] .ci/aws: Backport a bunch of CI changes to include region-less Jenkinsfile definitions
#698
a-szegel
closed
1 week ago
3
[v1.11.x-aws] .ci/Jenkins: Backport a bunch of CI changes to include region-less Jenkinsfile definitions
#697
a-szegel
closed
1 week ago
1
[v1.10.x-aws] .ci/aws: Backport a bunch of CI changes to include region-less Jenkinsfile definitions
#696
a-szegel
closed
1 week ago
4
defaults: make dmabuf opt-in
#695
aws-nslick
closed
1 week ago
0
.ci/Jenkins: General Cleanup and Remove Region/CI From CI
#694
a-szegel
closed
2 weeks ago
1
Add platform data settings for TRN2
#693
hunnorth
closed
2 weeks ago
0
tuner: add model base tuner and refactor for co-exist
#692
taeilum00
closed
1 week ago
0
[v1.11.x] .ci/aws: Switch CI to persistent clusters with containers
#691
sunkuamzn
closed
2 weeks ago
1
[v1.12.x] .ci/aws: Switch CI to persistent clusters with containers
#690
sunkuamzn
closed
2 weeks ago
1
[Feature request] Topo file for g6e.48xlarge
#689
sean-smith
closed
2 weeks ago
2
cuda: build flag for dynamically or statically linking cudart
#688
aws-nslick
closed
2 weeks ago
0
Switch CI to persistent clusters with containers
#687
sunkuamzn
closed
2 weeks ago
2
aws: Override libfabric link_attr for certain platforms
#686
rajachan
closed
2 weeks ago
0
MR: Enforce page-aligned buffer registration for iovec and add corresponding test case
#685
mozarhua
closed
2 days ago
0
Reduce repetitive INFO printing
#684
bwbarrett
closed
2 weeks ago
0
Add option to abort() on error
#683
bwbarrett
closed
3 weeks ago
0
.ci/aws: Move p5 capacity to CGK and other changes
#682
sunkuamzn
closed
2 weeks ago
3
.ci/aws: Move p5 capacity to CGK and other changes
#681
sunkuamzn
closed
2 weeks ago
2
.ci/aws: Move p5 capacity to CGK
#680
sunkuamzn
closed
3 weeks ago
0
Fix device sorting on aws platforms
#679
bwbarrett
closed
3 weeks ago
4
prepare v1.12.1-aws
#678
aws-nslick
closed
3 weeks ago
1
prepare v1.11.1-aws
#677
aws-nslick
closed
3 weeks ago
5
Revert vf rail sorting patches
#676
bwbarrett
closed
3 weeks ago
1
Revert vf rail sorting patches
#675
bwbarrett
closed
3 weeks ago
1
Revert "platform-aws: Add EFA-specific rail sorting on VF index"
#674
bwbarrett
closed
3 weeks ago
1
rdma: add option to round robin the ctrl msg, and use shared CQs for control and data endpoints
#673
AmedeoSapio
closed
3 weeks ago
1
Add p5en platform_data and update default latency for undefined platforms
#672
rajachan
closed
4 weeks ago
0
Test CI
#671
a-szegel
closed
4 weeks ago
0
Cleanups from adding a domain interface
#670
bwbarrett
closed
4 weeks ago
2
fix: Change multiplexer scheduler to use two rails instead of three
#669
arunkarthik-akkart
closed
4 weeks ago
2
rdma: add option to round robin the ctrl msg
#668
AmedeoSapio
closed
4 weeks ago
1
Fix a number of duplicate definition names
#667
bwbarrett
closed
3 weeks ago
4
Simplify locking and enable FI_THREAD_DOMAIN
#666
bwbarrett
closed
1 month ago
1
Next