CFSworks / nvml_fix

A workaround for an annoying bug in nVidia's NVML library. Allows nvidia-smi to work once more!
98 stars 19 forks source link

Would Really Love A Update For Newer Drivers #9

Closed ghost closed 5 years ago

ghost commented 8 years ago

Well I guess I'm to late to the party... :(

Sure would be nice to have this updated for the latest 367.44 drivers.

Thanks

victorhcm commented 7 years ago

I'm also interested, but I have no clue if this would work for driver version higher than 331. From what I've seen, I think newer versions of nvml.h would be required, but I'm not sure where to find it.

It is being used in nvidia-docker, is it this same version?

flx42 points out that it should be in /usr/local/include/nvml.h, but I can't find it in my server.

victorhcm commented 7 years ago

@CFSworks can you give me any pointers on nvmlDevice_t? Does it have any documentation describing what each index represents? I'm working on a patch for +352.39.

void fix_unsupported_bug(nvmlDevice_t device)
{
    unsigned int *fix = (unsigned int *)device;
#if defined(NVML_PATCH_319) || defined(NVML_PATCH_325)
# ifdef __i386__
    fix[201] = 1;
    fix[202] = 1;
# else
// ...
mstnb commented 7 years ago

@victorhcm Also just found this, but I'm trying to build this for the 378.13 driver. I have the same problem with the indexes, but the new nvml.h came with cuda (on my arch linux installation) and is located in /opt/cuda/include/nvml.h.

Would love to get this to work!

victorhcm commented 7 years ago

@linux-addict you're using CUDA 8.0, right? Do you want to get in touch so we can work this around?

mstnb commented 7 years ago

@victorhcm Yes I am. Sure, I currently don't have much time but we can definitely try to.

mstnb commented 7 years ago

@victorhcm Where would you like to get in touch? I think I made some progress.

victorhcm commented 7 years ago

Send me an email, it is victo<...>.

On Sat, Apr 22, 2017, 4:43 PM linux-addict notifications@github.com wrote:

@victorhcm https://github.com/victorhcm Where would you like to get in touch? I think I made some progress.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CFSworks/nvml_fix/issues/9#issuecomment-296396809, or mute the thread https://github.com/notifications/unsubscribe-auth/ADoEbsjMB1calH1JO2qn9Ne2KAu8O8a3ks5rylhqgaJpZM4KKtQI .

mstnb commented 7 years ago

@victorhcm Have you received my e-mail?

yesme commented 7 years ago

@victorhcm @linux-addict Any luck here? It sounds to me the key is to find the declaration for nvmlDevice_st. Once we know its structure, we know what bit to set...

mstnb commented 7 years ago

@yesme I couldn't work on it for the last weeks, but as far as I remember it is indeed. The problem is that this declaratation is nowhere to be found... First we have to get it working as normal with the shim inbetween, then we could find the bit by trial and error or simply with debugging. The problem is that I can't get it to work completeley (nvidia-smi crashes), but as soon as this is done we should be good to find the correct bit. Do you want to join us?

yesme commented 7 years ago

@linux-addict I worked on it in the past a few days but no luck there. I will move on anyways. Nvidia sucks.

imriss commented 7 years ago

How can someone join the effort?

ghost commented 7 years ago

I posted this a year ago, and see it dug up again, and I now can't remember what this is for...

LOL

So is this still some issue in the latest drivers?

375.66 runs great here....

@yesme LOL, Nvidia is nice in Linux, used it for many years without problems...

pszi1ard commented 7 years ago

To those wanting an update: what features do you expect if you happen to be able to hack nvml so set application clocks? FWIW on Maxwell there was a noticeable effect, but on Pascal I can't see any on the TITAN cards (where it's supported). I don't think it's worth going down this path as any nvml behavior will require the driver and firmware of GeForce cards to allow/support the desired feature.

I'd strongly recommend pushing NVIDIA to give back the freedom to users to use their GPUs in a more flexible manner, e.g. allow fixed SM/mem clock speeds, allow setting fixed fan speed (or fan profile) -- i.e. allow disabling the gamer optimization. For this I strongly encourage everyone to:

cdarken commented 6 years ago

@pszi1ard best move is to stop buying nvidia crap. I know I will, because it's obvious they barely care.

ghost commented 6 years ago

@cdarken I've been using Nvidia in Linux for 15 years and I've never had a problem ever, so I'm either very lucky, or Nvidia's not crap, take your choice... LOL

Truth is, Nvidia isn't crap, and the support is still a lot better then AMD GPU in Linux...

I just bought a EVGA 1060 GTX I'm running, and it's doing great!

I won't aruge that AMD has a nice open source path they are taking, and it would be great to see Nvidia to do the same, but to say crap, well, that's not really fair to say, because if you think Nvidia is crap, I wouldn't assume you've used ATI/AMD GPUs in Linux very long then...

You want to game in Linux, then Nvidia is still the way to go...

Cheers

JamAndCheese commented 6 years ago

I have a couple of old cards that return "N/A" on nvidia-smi.

I have been tracing my way through using gdb and I am gaining an understanding of where things are and how they are called. I still have nothing to go on editing memory locations like @CFSworks did.

@linux-addict I would like to join anybody trying to make progress here.

rschwieterman commented 6 years ago

Would also like to be involved with any efforts.

pszi1ard commented 6 years ago

@pszi1ard best move is to stop buying nvidia crap. I know I will, because it's obvious they barely care.

@cdarken That's everyone's choice to make, but frankly for many workloads no processor can beat NVIDIA's GPUs (especially high-end GP102 and GV100).

@cdarken I've been using Nvidia in Linux for 15 years and I've never had a problem ever, so I'm either very lucky, or Nvidia's not crap, take your choice... LOL

A single data-point is indeed very representative ;)

Truth is, Nvidia isn't crap, and the support is still a lot better then AMD GPU in Linux...

FTFY: Truth is, for some use-cases Nvidia isn't crap, for others it is an utter pile of horse-shit and for many use-cases the support is a lot better than AMD GPU on Linux, but for others it's a pathetic mess. ;)

cdarken commented 6 years ago

@pszi1ard I recently switched back to Radeon and the support they offer for Linux is miles above what's been a few years ago. I was surprised to see that you can monitor temperatures, voltages, power draw, you can overclock, control fan speeds and so on. And in next versions of the kernel we'll get Wattman support.

pszi1ard commented 6 years ago

@cdarken Sure, enjoying a smooth desktop and some lightweight gaming mostly works on AMD, so I heard (admittedly I've only used Intel and NVIDIA for display for the last 5-7 years). Try to do compute and you'll see that even when AMD GPUs should in theory be competitive (at least in perf, or rarely in perf/W), the software stack is a major PITA and is rarely free of major hurdles.

mstnb commented 6 years ago

@JamAndCheese I'm very busy with my studies now... I will tell you when I find some time to begin working on it again.

JamAndCheese commented 6 years ago

@linux-addict No problem. if you want to throw my way whatever you did, I can start adding to it.

tofurky commented 6 years ago

edit: this commit was merged into the main repo today via https://github.com/CFSworks/nvml_fix/pull/12

for those on 390.x on x86_64, please try https://github.com/tofurky/nvml_fix/tree/nvidia-390

similar code has been working well for me for a couple months now. initially i'd broken backwards compatibility with some brutally hacked in code, but refactored it tonight into something that's more suitable for submitting a PR to this repo.

also, if anyone is able, i'd appreciate if someone could test whether old versions are still working in my nvidia-390 branch (i.e. 331.x, 325.x, 319.x).

i've got some WIP stuff to brute-force the offsets for new versions of the nvidia driver. my goal is to add a make target e.g. "make brute-force" which tests all permutations for the bytes to flip in nvmlDevice_t device. a rough-draft version of it is how i figured out the new 390.x offsets.

tofurky commented 5 years ago

closing as newer drivers are now supported.