aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
Apache License 2.0
129 stars 51 forks source link

Support Red Hat Enterprise Linux 9+ #290

Open tmh97 opened 8 months ago

tmh97 commented 8 months ago

I am wondering if you all have plans for testing the plugin with RHEL 9, or if you will eventually claim support for RHEL9.

I'm done some preliminary testing with the OPX libfabric provider on a RHEL9 system with NVIDIA A40 gpus and have had some success running the unit tests, functional tests, and NVIDIA's nccl tests.

bwbarrett commented 8 months ago

Good question, to which I don't have any answer. Currently, we've been ultra conservative in our list of supported, to those that have nightly regression testing inside AWS. This is because "supported" is one of those words with very specific legal meaning. On the other hand, as people outside of AWS start using the plugin in their environment, that makes less sense and we need to figure out how to split the two concepts up (probably by just having an "AWS Supported" list that's a subset, or a "known works with" list or something. Open to feedback here.

tmh97 commented 8 months ago

@bwbarrett I think an "AWS Supported" and "known works with" is a great idea! After I do some more testing with Rhel 9 and older Rhel 8 versions, I can open a PR that closes this issue and introduces a "known works with" section into the README.

Also, I am wondering if it would be possible to add the OPX libfabric provider as a "known support libfabric provider" at some point in the future (off-topic for this issue). Maybe at some point we could provide performance/testing results and showcase the OPX provider as an example of how non-efa libfabric providers can benefit from adopting the work you have done. I saw some of your older youtube talks where you were calling for this sort of community involvement. Let me know what you think! We at Cornelis much appreciate the work you folks have done here :)

bwbarrett commented 8 months ago

Selfishly, I'd like to tie "known works with" and some regularly testing, but we'd definitely love more contributions :).

tmh97 commented 8 months ago

Gotcha. Would you mind clarifying a bit?

The more contributions part I understand, we will definitely keep on that!