OpenFabrics / fsdp_setup

Setup scripts for use with the FSDP cluster
GNU Lesser General Public License v2.1
0 stars 1 forks source link

Need to run fsdp_setup/rdma_setup.sh on builder-00 #63

Closed dledford closed 2 years ago

dledford commented 2 years ago

@lylavoie @JSpewock While there are a lot of things manually configured on builder-00, it still needs to run the rdma-setup.sh script in order to get the proper host names in /etc/hosts for RDMA fabrics and it will also clean up all the the interface definitions. At the moment, the proper interfaces are not defined for the dhcp server to be able to work properly (in particular, both IB and RoCE require the PKey numbers to match the defined PKey subnets and at the moment they do not, which is why dhcp isn't working on opa0.8022 and why we are getting the wrong response from dhcp on ib0.8002 for example). Once this is run, you will be able to use the ib() and en() and opa() commands to see the status of all of the RDMA connections at a glance (these are bash functions defined in the .bashrc file that rdma-setup.sh installs in root's home directory, if you have other things in that file you want saved, you should merge the files after running rdma-setup.sh).

dledford commented 2 years ago

Once you run rdma-setup.sh on builder-00, you will also need to modify the dhcp server base configuration as I enabled an extra dhcp subnet. The opa0 subnet is 172.31.20, opa0.8022 is 172.31.22, and opa0.8024 is 172.31.24

lylavoie commented 2 years ago

@dledford right now, the figure on the main docs doesn't show opa0.8022, etc. Can you make sure that figure is accurate for the networks the cluster will run, etc. That is what I worked form when setting the fabrics and IP networks. Ideally, if you can keep a log of the differences while you make the updates in this issue, that will also help a bunch.

dledford commented 2 years ago

@lylavoie Sure. I've downloaded the current version and updated it to be correct. There were some differences between the last version of the graphic I made and what was in the repo, so I'm assuming there were some mis-communications that resulted in things getting skewed. In any case, all the right interfaces/PKeys are listed. As mentioned, if you run the rdma-setup.sh script on builder-00, all the proper interfaces will be created automatically.

lylavoie commented 2 years ago

@dledford I've run this, but I think we really need to clean some of this up, if it is going to be run on builder multiple times. For example, it wipes out the lights out interface configuration (which I put back), it disables firewalls (bad practice), etc. Similarly, it is still adding DNS search domains that are specific to redhat (i.e. lab.bos.redhat.com).

Can you confirm everything looks as you would like it and we can close this issue.

lylavoie commented 2 years ago

This has been done, run on builder.