CFSworks / nvml_fix

A workaround for an annoying bug in nVidia's NVML library. Allows nvidia-smi to work once more!
98 stars 19 forks source link

Document process for figuring out offsets for a new driver #10

Closed danielkza closed 7 years ago

danielkza commented 7 years ago

I would like to try to get this working with newer driver versions, but I have no idea how the offsets for past drivers versions were determined. I would be willing to follow the process myself, but it isn't documented anywhere.

CFSworks commented 7 years ago

The process of trying to understand how the NVML library decides whether or not a card is supported is more of an art than a science. I'm not sure if Nvidia even still caches that value in the device struct, so there may not be "offsets" anymore. I haven't personally updated this library in a long time because my card (TITAN) was whitelisted in later versions of NVML.

The easiest way to go about doing this would be to run your libnvidia-ml.so through a disassembler, look at a library entry point that only works on certain cards (such as nvmlDeviceGetUtilizationRates), and determine the control flow that will cause it to return NVML_ERROR_NOT_SUPPORTED. The driver EULA does forbid disassembling under its reverse-engineering clause, so doing this would put you in violation of the license agreement, for what it's worth.

If you don't want to violate the EULA and/or don't want this to be a manual process, you could perhaps automate it by fuzzing: write a small C program that initializes nvml, gets a device, modifies some offsets as specified on the command-line, then attempts a call and reports its success in the exit code. Then, write a script to keep running the C utility with different offsets until something succeeds. I can't guarantee that this will succeed (since Nvidia seems adamant that the NVML library shouldn't work correctly on some of their products, they may have changed it to check the card type every time a call is made, and there may not be a cached value), and I also can't guarantee that this is safe: it might cause NVML to misbehave and write to some hardware register that bricks your GPU.

If Nvidia did change it and remove the "device supported" flag from the device struct, binary modifications may be necessary to get it to work. Redistributing modified binaries is against the EULA, so this repository would probably get removed from GitHub via DMCA takedown request by Nvidia if I went that route. The nvml_fix wrapper may instead have to mprotect NVML's "is supported" routine to be writable, replace a few instructions, then mprotect back to executable. An in-memory binary modification is not redistribution, so this doesn't violate Nvidia's EULA.

The last idea I can come up with would be to take an older version of libnvidia-ml.so that still works with nvml_fix, and try to get it to work with a newer kernel driver. I haven't attempted this, so I don't know how difficult that would be to accomplish, but if it worked that might be the direction this repository should go. :)

danielkza commented 7 years ago

@CFSworks Thank you very much for your directions. I'll evaluate your suggested approaches and see if I can get one of them to work, and post a PR if I do.