Azure / azhpc-images

Azure HPC/AI VM Images
MIT License
95 stars 77 forks source link

add rdma_rename monitoring service #281

Closed shivanispatel closed 11 months ago

shivanispatel commented 11 months ago

created a monitoring service that runs every 60 seconds checks if the rdma names are correct if not, enables and restarts the 'azure_persistent_rdma_naming' service

for the sake of completeness, have included testing process below :

test method 1 : mess up names and check that monitor catches the change + restarts azure_persistent_rdma_rename

  1. made sure both azure_persistent_rdma_rename service and the azure_persistent_rdma_rename_monitor service were running (I modified the azure_persistent_rdma_rename_monitor service so that every 3 minutes it restarts the rename service)
  2. stopped and disabled azure_persistent_rdma_rename service and checked that it was inactive
  3. ran my mess_up_rdma_names service (created for testing purposes) to change the rdma names
  4. confirmed that the rdma names are 'mlx5_#' (contain no "an" and no "ib")
  5. waited for 3 minutes to give the monitor time to run
  6. confirmed that azure_persistent_rdma_rename had restarted and the rdma names are now correct (contain "an" and "ib")
  7. changed the monitor to wait every 1 minute instead of 3 (set to 3 to allow time to mess up rdma names for testing purposes)

test method 2 : when the VM reboots the monitor should run azure_persistent_rdma_rename

  1. ensured that azure_persistent_rdma_rename service and azure_persistent_rdma_rename_monitor service were both running
  2. stopped and disabled azure_persistent_rdma_rename service
  3. ran mess_up_rdma_names service (service created for testing purposes to mess up rdma names)
  4. confirmed that the rdma names are 'mlx5_#' (contain no "an" and no "ib")
  5. 'sudo reboot' and rebooted the VM
  6. confirmed that azure_persistent_rdma_rename service ran again
  7. confirmed that the rdma names are now correct

test method 3 : turn accelerated networking on/off to recreate the effect of adding and removing a VF device

  1. Confirm that azure_persistent_rdma_rename and azure_persistent_rdma_rename_monitor are both running
  2. Disable and stop azure_persistent_rdma_rename and confirm
  3. Mess up the rdma names (by running mess_up_rdma_names test service) and confirm that they are now incorrect
  4. Disable accelerated networking
  5. Confirm that azure_persistent_rdma_rename_monitor is still running and within 60 seconds (the time limit we've set) enables + starts azure_persistent_rdma_rename
  6. Confirm that the rdma names are now correct
  7. Repeat the above steps, but this time enabled accelerated networking, and saw that the monitor caught the name changes and fixed them