NVIDIA / mig-parted

MIG Partition Editor for NVIDIA GPUs
Apache License 2.0
171 stars 41 forks source link

Startup order #11

Open gfrankliu opened 2 years ago

gfrankliu commented 2 years ago

Based on https://github.com/NVIDIA/mig-parted/blob/main/deployments/systemd/nvidia-mig-manager.service#L19 nvidia-mig-manager.service starts before nvidia-persistenced.service. This causes a problem because nvidia-persistenced.service is responsible to load the nvidia kernel modules on a server, so it needs to start first, otherwise nvidia-mig-manager.service won't be able to create the mig without nvidia drivers being loaded.

klueska commented 2 years ago

Typically kernel modules are loaded by the systemd-modules-load.service (including the nvidia module). This is part of the sysinit.target which the nvidia-mig-manager.service depends on.

gfrankliu commented 2 years ago

Nvidia gpu drivers (kernel module) are only loaded when used, and unloaded if not used. To solve that issue, persistent daemon is used. Details can be found https://download.nvidia.com/XFree86/Linux-x86_64/396.51/README/nvidia-persistenced.html

We need to make sure that daemon is started first, in a server, so that the gpu kernel module is loaded even though not being used.

klueska commented 2 years ago

The nvidia kernel module is most definitely not loaded and unloaded across each use. Typically it is loaded at system boot and then remains loaded until the system is shutdown.

If it is not loaded at system boot then running one of the nvidia utilities (e.g. nvidia-smi or nvidia-persistenced) will load the kernel module before it runs. It will not unload it once it is done though. It will remain loaded until the system shuts down or a user explicitly unloads it (via rmmod for example).

What the page you linked refers to is about keeping the GPU in persistence mode or not. This has nothing to do with whether the module is loaded or not, but rather whether the (always loaded) module keeps GPU state alive or not across operations.

Without the persistenced service (or the old persistence mode being enabled) the driver will tear down GPU state across each operation, making it very slow to respond. With persistenced this state is kept alive and the driver is mich more responsive.

In any case, it seems that your system is not loading the module during sysinit and instead relying on the persistenced service to do it for you.

I would recommend adding the nvidia module to the set of preloaded modules as is commonly done on other systems (e.g. the Nvidia DGX systems that this nvidia-mig-manager.service was built for and is tested on).

gfrankliu commented 2 years ago

You are right that the modules are indeed autoloaded, but it seems to have a bit delay, and when nvidia-mig-manager.service runs, it sometimes misses it since nvidia-mig-manager.service is a oneshot service. By the time if I login to check , I do see the modules loaded, and if I manually run nvidia-mig-manager.service again, it works fine. Maybe we can have nvidia-mig-manager.service to wait and retry.

gfrankliu commented 2 years ago

Would it cause any potential issues if we were to move nvidia-persistenced.service from Before to "After"?

Unlike nvidia-mig-manager.service (oneshot), persistenced is a daemon so it can catch up even if it starts before mig-manager.

klueska commented 2 years ago

In general, the nvidia-mig-manager.service needs to come online and run to completion before all nvidia services other than nvidia-persistenced, nvidia-fabricmanager and nv_peer_mem. On my system this includes:

  nvidia-mlnx-config.service
  nvsm-api-gateway.service
  nvsm-mqtt.service
  nvsm-notifier.service
  nvsm-plugin-environment.service
  nvsm-plugin-gpumonitor.service
  nvsm-plugin-memory.service
  nvsm-plugin-monitor-environment.service
  nvsm-plugin-monitor-network.service
  nvsm-plugin-monitor-pcie.service
  nvsm-plugin-monitor-storage.service
  nvsm-plugin-monitor-system.service
  nvsm-plugin-network.service
  nvsm-plugin-pcie.service
  nvsm-plugin-processor.service
  nvsm-plugin-storage.service
  nvsm-selwatcher.service
  nvsm.service

The reason for this is because these services become clients of the GPU, prohibiting the nvidia-mig-manager from resetting any GPUs if mig-mode changes are necessary.

Unfortunately, the only dependencies these services have in the systemd dependency graph are on the sysinit.target (which is why we explicitly say that the nvidia-mig-manager has to run Before this target completes so we can be sure that the nvidia-mig´manager is finished before any of these services start).

Likewise, the nvidia-persistenced service also has a dependence on sysinit.target so it can't be moved before the nvidia-mig-manager service so long as we need the nvidia-mig-manager to run before the sysinit.target.

The right way to do this would be to have all of the nvidia services that can become clients of the GPU to depend on an intermediate target that then, in turn, depends on sysinit.target. Unfortunately that is not how things are set up at the moment though, so this is the best we can do.