GoogleCloudPlatform / ai-infra-cluster-provisioning

Apache License 2.0
37 stars 25 forks source link

Add NCCL workload for A3 mega and update README guides. #376

Closed samcmho closed 3 months ago

anthonyhan2 commented 3 months ago

I can't guarantee things won't break if we just change all fastrak to tcpxo, so I would highly recommend @samcmho to test it out first. The pkill command probably shouldn't change anytime soon.

samcmho commented 3 months ago

Tested. It is running fine on A3mega.