foucault / nvfancontrol

NVidia dynamic fan control for Linux and Windows
GNU General Public License v3.0
208 stars 20 forks source link

Just playing around with the new RTX code #24

Closed Gordin closed 3 years ago

Gordin commented 4 years ago

I've been playing around with the new API stuff (mainly trying to figure out how Rust works...). I don't expect a merge, I'm just making the PR in case you want to grab something from what I did.

For now, I just combined the set_fanspeed and set_ctrl_type method into one set_fancontrol method that sets both with only one API call with the new API. So far everything works on my RTX 2070.

I've also tried to look further into how Afterburner uses the API. This is what a call to GetClientFanCoolersInfo gets me with my RTX 2070. I think the "2C 06 01 00" corresponds to version in your NvGpuFanCoolersInfo struct. Before the call everything other than the version is 00.

Looks like Afterburner uses the _reserved1 (from your struct) as another check to see if the call succeeded. If _reserved1 isn't 1 it just errors out. Also I think there is a field missing in your struct. The "1C 0C 00 00" => 3100 seems to be the maximum RPM of the fan. At least the max RPM of my GPU fan happens to also be 3100...

Also, if the 3 bytes before that mark available coolers it would suggest that I might have 2 different fans to control, which might be true since Gigabyte advertises my card as having 3 different fans with alternating spin directions, so maybe I can control the 2 outer fans independent of the middle one. I'll try to play around with that.

foucault commented 4 years ago

Merging two functions into one is good but my main concern is the use of compile flags on Windows. We probably need a boolean array (with length of NVAPI_MAX_PHYSICAL_GPUS) defined in the NvidiaControl struct to identify which of the enumerated GPUs support the new API in rutime. I was thinking on running NvAPI_GPU_GetClientFanCoolersStatus on library initialisation to populate that vector. It's a little chatty but it's only done once (well, unless you hotplug GPUs!!)

Also I think there is a field missing in your struct. The "1C 0C 00 00" => 3100 seems to be the maximum RPM of the fan. At least the max RPM of my GPU fan happens to also be 3100...

If that's the case then what's the max_rpm field in the NvFanCoolerInfo struct? What does it report for you?

Looks like Afterburner uses the _reserved1 (from your struct) as another check to see if the call succeeded. If _reserved1 isn't 1 it just errors out.

Have you checked what _reserved1 shows for your case? It might help us to identify its connection to the support (or not) of the new API.

Also, if the 3 bytes before that mark available coolers it would suggest that I might have 2 different fans to control, which might be true since Gigabyte advertises my card as having 3 different fans with alternating spin directions, so maybe I can control the 2 outer fans independent of the middle one. I'll try to play around with that.

This is possible but it should've still been enumerated the fans if that was the case (COOLER-0, COOLER-1, etc.). I think that's the point of the coolers and count fields in NvGpuFanCoolersStatus and NvGpuFanCoolersInfo. Again I can't check because my GPU has only got a single fan :-( .

Additionally I think there might be a bug in the the new code. NvAPI_GPU_GetClientFanCoolersInfo is never used. It should have been used instead in fn gpu_coolers(). What happens if you replace lines 761--775 with the following code?

        #[cfg(feature="rtx")] {
            let mut info  = NvGpuFanCoolersInfo::new();
            match unsafe {
                NvAPI_GPU_GetClientFanCoolersInfo(self.handles[gpu as usize],
                    &mut info)
            } {
                0 => {
                    let count = info.count as u32;
                    Ok(Cow::Owned(
                        (0..count).collect::<Vec<u32>>()))
                },
                i => Err(format!("NvAPI_GPU_GetClientFanCoolersInfo() failed; error {}", i))
            }
        }

The above might explain the erroneous enumeration of coolers if you do indeed have > 1 independent coolers.

Sorry I know it's a lot of things but I'm really dependent on you and your GPU for debugging :-D. Again thanks so much!

Gordin commented 4 years ago

Merging two functions into one is good but my main concern is the use of compile flags on Windows. We probably need a boolean array (with length of NVAPI_MAX_PHYSICAL_GPUS) defined in the NvidiaControl struct to identify which of the enumerated GPUs support the new API in rutime. I was thinking on running NvAPI_GPU_GetClientFanCoolersStatus on library initialisation to populate that vector. It's a little chatty but it's only done once (well, unless you hotplug GPUs!!)

Yeah, I figured you were just using the compile flag because we don't know how to check for the new API yet and there is no code that would switch between APIs yet anyway. I'm not at the stage to rewrite everything yet anyway, I've never used Rust before looking at this project...

Additionally I think there might be a bug in the the new code. NvAPI_GPU_GetClientFanCoolersInfo is never used. It should have been used instead in fn gpu_coolers(). What happens if you replace lines 761--775 with the following code?

        #[cfg(feature="rtx")] {
            let mut info  = NvGpuFanCoolersInfo::new();
            match unsafe {
                NvAPI_GPU_GetClientFanCoolersInfo(self.handles[gpu as usize],
                    &mut info)
            } {
                0 => {
                    let count = info.count as u32;
                    Ok(Cow::Owned(
                        (0..count).collect::<Vec<u32>>()))
                },
                i => Err(format!("NvAPI_GPU_GetClientFanCoolersInfo() failed; error {}", i))
            }
        }

The above might explain the erroneous enumeration of coolers if you do indeed have > 1 independent coolers.

I'll try that. I know that the Info call isn't actually used in the code yet, it was just the first thing I looked at in the debugger. Sorry I know it's a lot of things but I'm really dependent on you and your GPU for debugging :-D. Again thanks so much!

No Problem, I'm just learning new stuff ^^ I was just saving my work and committed what I have, I'll look at the contents of the struct when I have time again. Probably today/tomorrow, depending on your timezone.

Gordin commented 4 years ago

Update: I got the call to the new API for Cooler Info working. Seems like the version that the new API calls get passed is calculated differently than just the size of the struct. Afterburner does this :

memset(&X, 0, 0x62c);
X = 0x1062c;
ClientFanCoolers_GetInfoQuery(gpu_handle, &X);

So the version value would be 0x62c + 0x10000. The 0x10000 is consistent with other new calls, so this is probably just to show that this is the new API. I've split this up into a version and new_api field as u16s and it works if you initialize version with 0x62c and new_api with 1. I just have no idea how the 0x62c is calculated. Every calculation I've tried with the size of the struct doesn't get me to 0x62c. I also tried to fill the struct up until the size (in bytes) matches, but everything after the max RPM field (which works and gets me 3100) doesn't get touched by nvapi. I'll just leave the value hardcoded and extract the values for the other new API calls from Afterburner.

foucault commented 4 years ago

I'm a bit concerned; the max_rpm in NvGpuFanCoolersInfo just doesn't make sense since there is the same field per cooler in NvFanCoolerInfo unless this is a sort of global maximum RPM (but we cannot know since there is no documentation :-| ). What's the value of max_rpm field in NvFanCoolerInfo ? Is this consistent with Afterburner?

Gordin commented 4 years ago

Hm, I just noticed I actually removed NvFanCoolerInfo from the struct to fit the max_rpm in the call, but I probably should have aligned the NvFanCoolerInfo in the struct so that it fits with the max_rpm value... I'll try to do that, that probably makes more sense... That would mean that I only have 1 cooler though since 3100 only appears once.

Gordin commented 4 years ago

Yeah I was just stupid... I tried to fit the rpm field into the structs without looking what was in NvFanCoolerInfo... The calculation for the version field also works like in the old API. This is what I get from a call to NvAPI_GPU_GetClientFanCoolersInfo (with 28 more empty cooler fields):

Gordin commented 4 years ago

I also looked at NvAPI_GPU_GetClientFanCoolersControl and NvAPI_GPU_GetClientFanCoolersStatus. Everything in the structs looks like you would expect. None of the _reserved fields have any values in them. Only maybe wierd thing, the _reserved1 from the Control structs also stays 0. In the Info call I renamed it to call_success because Afterburner quits if it isn't 1, but I guess for Control it's different. I'm going to look in Afterburner again to see what they use it for.

Gordin commented 4 years ago

Hm, looks like Afterburner just doesn't check the reserved value from the GetControl call. GetInfo in Afterburner will return if the return code of the call isn't 0 and if the second u32 isn't 1, but GetControl only has a check for the return code.

But so far the usage of the new API seems to work, I guess I'll try to dig around in Afterburner and find out how they actually decide which API to use. They have a IsSupported function that just returns the value of a variable, maybe I can figure out how this is set.

Gordin commented 4 years ago

Just posting to remember where I left off next time... I can definitely say that MSI uses some part of the result of a ClientFanCoolers_GetInfo call to determine if the card is able to use the new API. So far I've traced it back to here: (My naming for the variables. api_or_fan_count is the reserved that I renamed to call_success, the "Array" is probably not an array but I didn't have a better name yet...) Basically, if the NVAPI call succeeded and the first byte after the version is 1 (call_success), then it grabs the value from some struct, "| 0x40"s it and that's what Afterburner uses later to see if the new API is supported. It's a bit field and is also used to determine PS20_IsSupported, VFCurve_IsSupported and VFCurve_IsAdjustedClock. No idea if that's important for nvfancontrol.

Next time I'll try to find out if the value in (uint)*gpuFeaturesArray is from another NVAPI call. All I know is the memory location changes every run...

Gordin commented 4 years ago

Small update: dug around a bit more. Looks like there is a point in Afterburner where they just try to get the count of GPUs with the old and the new API and go from there. It's a bit weird though, in all the other calls they do a memset first and the size of the memory is the version for the call, but in this call they don't do a memset, just pass pointers to a new variable. I Still have to figure out how to call that correctly in nvfancontrol. Because there is no memset I also didn't find out what the call returns because I don't know where to look yet...

I also looked in nvapi.dll itself and I actually found a complete Index of all undocumented functions. Their list has 1885 functions in total, very few seem to be placeholders but most should be callable. There don't seem to be any useful debug symbols left in the DLL, so understanding what the functions do would take some time... I've added them to the Query Enum in my last commit. I'm going to try to extract the version field and memory size that they need next, so I can play around with some of them.

foucault commented 4 years ago

I must admit: I'm a bit lost here! Too much information! Joking aside the functions themselves are not necessarily the problem but the actual signatures (and the structures that go along with them). As this has expanded a bit beyond the initial scope. I'll probably go for the solution that I initially proposed just to close the existing issue with RTX cards. You can keep probing if you want so that we can create a more permanent solution by uncovering more of the NVAPI. I'm a bit busy at the moment but I'll get to it once I have some more free time.

Gordin commented 4 years ago

Another update... I've added the full list of undocumented functions in NvAPI along with their Query codes and call signatures to the repo and also the ghidra script I wrote to generate it. Maybe this doesn't really belong here, but I want to keep everything in one place for now... Also, the signatures of the functions isn't 100% accurate some parameters that are marked as uint still contain addresses.

I also managed to call the nvapi function that didn't use memset. Turns out I reimplemented NvAPI_EnumPhysicalGPUs from the official API by accident, so now we can call that directly with the Query Code without going through the official wrapper... That doesn't really help on its own, but it'll help me figuring out what Afterburner uses. Apparently they chose not to use the wrapper from the official docs but call everything with a query code instead. I guess they have some documentation that we don't >_>

Gordin commented 4 years ago

Oh, I forgot, if you have any builds that you want me to test for older and newer cards, I can do that now. I did this https://twitter.com/g0rdin/status/1267506774699118592 so now I have an RTX 2070 Super and a GTX 970 in my system. nvfancontrol also picks both up correctly. Haven't really tested anything beyond the enumeration yet though. I'll do some testing in the next week to see what parts of the RTX code work on the GTX.

Gordin commented 4 years ago

Here is a comparison of an RTX 2070 Super and a GTX 970: https://imgur.com/a/5Z6raXK Looks like RTX supports only the new API and GTX only the old. I haven't found any call that works on both cards and indicates which API to use yet, but I guess doing one of the calls and deciding the API based on the return value should work for now

Gordin commented 4 years ago

I'm playing around with the code right now to get both cards working at the same time. Basically just adding a bool array that controls whether to use the new API or not and having 2 implementations for some calls. So far I'm getting at least the cooler enumeration for both cards.

INFO - NVIDIA driver version: 446.14
INFO - NVIDIA graphics adapter #0: GeForce RTX 2070 SUPER
INFO -   GPU #0 coolers: COOLER-0
INFO - NVIDIA graphics adapter #1: GeForce GTX 970
INFO -   GPU #1 coolers: COOLER-0

I was looking at NvidiaControl and NVFanManager to see where to best make the decision for the API, but it looks like right now the code isn't really set up to handle multiple different cards? (Or different settings for multiples of the same card for that matter). Seems like in order to do this right' there should be either one NvidiaControl or NVFanManager per GPU and some settings might need to be moved around. Or am I missing something here?

Gordin commented 4 years ago

Got it working with both cards now (With running two instances for now) :) https://imgur.com/7sRwnIu

Going to update the pull request a bit now.

Gordin commented 4 years ago

Also, I intentionally separated the rtx calls from the old api to make it easier to have two completely different implementations of NvidiaControl. There's probably a better way to do this with Traits somehow, but nvfancontrol is still my first experience with Rust, so I'm not sure how exactly.

foucault commented 4 years ago

That's good progress; good detective work. We need to clean up that code and make a unified solution for both RTX and pre-RTX cards. Seems straightforward from now on.

Gordin commented 4 years ago

We need to clean up that code and make a unified solution for both RTX and pre-RTX cards.

I'm guessing you mean refactoring by "make a solution for both RTX and pre-RTX cards", but just to be clear, this branch already work with pre-RTX and RTX cards and chooses the correcty API automatically, it's just the cleanup that's missing.

foucault commented 4 years ago

OK, so I finally found out some time to test this with a couple of RTX cards (sorry that took a while!), seems like it's working properly. Probably more testing is required but there is only so much we can do at this point! There are a few things left to do before properly merging into master.

(a) Clean up windows.rs to remove functions that are not directly relevant to the functionality of nvfancontrol. We definitely don't need that long list of FuncXXXXs.

(b) We probably don't need the debugging script for the main repository. Maybe we can keep a list of the available functions for future reference. On the other hand if you believe it's useful to have around it might be good to clean it up and document it properly so that's easier for everyone to follow. Maybe move it to a util folder or something similar along with the json file.

(c) Finally it might be good to squash the commits into a single one or at least in conceptually separate commits so that the history looks a bit cleaner.

Again thanks for your effort, it's been thorough and very detailed. It really improves the project :-D