Saancreed / wine-nvml

NVIDIA Management Library wrapper for Wine
GNU Lesser General Public License v2.1
27 stars 5 forks source link

NVML crashing when running GPU CapsViewer #13

Closed SveSop closed 1 year ago

SveSop commented 1 year ago

Possible older versions too, but have not tested overly much. Running winehq-staging-8.10 for Ubuntu 22.04 from WineHQ repo works fine, but running winehq-devel-8.10 or as i tried now - TKG-staging-8.10 also seems to crash for some unknown (to me) reason.

Not really claiming it to be nvml tho, but found it strange that it did not run on the TKG-staging source, as that is supposed to be using wine-staging source with the TKG/Proton++ patches you set up (fsync, fshack and such). Running GPU CapsViewer without NVML works fine, although without hardware info for clocks and whatnot that nvapi picks up from nvml.

I ran it like this: WINEDEBUG=-all,+nvml wine ./GPU_Caps_Viewer.exe > caps.log 2>&1

caps_staging.log is winehq-staging-8.10 from WineHQ ubuntu repo caps_tkg.log is wine-tkg-staging from https://github.com/Frogging-Family/wine-tkg-git/tree/master/wine-tkg-git

caps_staging.log caps_tkg.log

Ideas? Wine did not seem to provide overly much interesting, and as usual doing a +relay log with wine ends up pretty huge and loads of unrelated cra** .

Saancreed commented 1 year ago

Thanks for the report, I'll take a look either later today (not very likely though) or tomorrow.

Saancreed commented 1 year ago

Huh, this is weird. The log contains both nvmlInit_v2 and calls made by dxvk-nvapi but there is also another nvmlInit_v2 as if GPU Caps Viewer itself tried to use NVML directly, together with some functions that dxvk-nvapi doesn't use. This is unexpected because the log suggests that this is a 32-bit application and on Windows NVML is available only for 64-bit applications. How does it even work on Windows then? :thinking:

SveSop commented 1 year ago

It seems to sometimes crash running GPU CapsViewer and using "More GPU Info" (That starts gpushark.exe) even without nvml usage... So that would be nvapi crashing i guess.

jp7677 commented 1 year ago

The crash is inside gpushark. I still had a very old version here, 0.22.0 from GPU Caps Viewer 1.51 that works just fine, but gpushark from GPU Caps Viewer 1.60 (presumably 0.29.4) indeed crashes.

At a very first glance the tails of the nvapi logs show nothing that sticks out:

0.22:

...
NvAPI_GPU_GetAllClockFrequencies: No implementation
NvAPI_GPU_GetPstates20: No implementation
NvAPI_GPU_GetPstates20: No implementation
NvAPI_QueryInterface (0x655dcd32): Unknown function ID
NvAPI_QueryInterface (0x65b1c5f5): Unknown function ID
NvAPI_GPU_GetPstates20: No implementation
(continues to call `NvAPI_GPU_GetPstates20`)...

0.29:

...
NvAPI_GPU_GetAllClockFrequencies: No implementation
NvAPI_GPU_GetPstates20: No implementation
NvAPI_GPU_GetPstates20: No implementation
wine: Unhandled page fault on write access to 00000000 at address 004089D8 (thread 0124), starting debugger...
...
Backtrace:
=>0 0x004089d8 in gpushark (+0x89d8) (0x02e8ff30)
  1 0x7b62a1a0 in kernel32 (+0x2a1a0) (0x02e8ff48)
  2 0x7bc5d247 in ntdll (+0x5d247) (0x02e8ff5c)
  3 0x7bc5daa8 in ntdll (+0x5daa8) (0x02e8ffec)
SveSop commented 1 year ago

I think i had a somewhat stub implementation of GetPstates20 someplace.. ill look into seeing what comes up when i pop that in maybe.. I also recently installed a newer version of GPU Caps Viewer, so that is probably the reason i started noticing this aswell then...

EDIT: Somewhat interesting that i am able to run gpushark directly with Ubuntu version of winehq-staging-8.10, but not GPU Caps Viewer 😏

SveSop commented 1 year ago

The crash is inside gpushark. I still had a very old version here, 0.22.0 from GPU Caps Viewer 1.51 that works just fine, but gpushark from GPU Caps Viewer 1.60 (presumably 0.29.4) indeed crashes.

GPU Caps Viewer 1.56.0 seems to work fine aswell as gpushark 0.27.0 using wine-staging-8.10 with nvml.. So i guess some new call was put in for newer version or something. 1.57.0 and newer crashes.. Changelog for 1.57.x reads amongs other things - Updated for NVAPI R525....

Saancreed commented 1 year ago

How am I supposed to download this thing? I found an URL on the site which leads to https://geeks3d.com/downloads/2023p/GPU_Caps_Viewer_1.60.0.0.zip but that just redirects me back to the front page.

SveSop commented 1 year ago

I was able to download the new and older version without issues... It just takes a few seconds for the "download button" thingy to pop up on the page, and should not redirect back. I use Firefox.. dunno if that could be something (And i use uBlock addblocker that may squelch some of such annoyances maybe)

Saancreed commented 1 year ago

Okay, I just had to use a browser with stock configuration and MS Edge did the job. Seems to work on wine-tkg-staging-protonified-8.7.r3 at least: image

I'll try upgrading my system to Wine 8.10 soon but there was some breakage around 32-bit applications so I decided to wait a bit, maybe it's a good time to revisit that idea.

SveSop commented 1 year ago

Hmm.. so it is a wine-8.10 issue then it might seem? I just tested GE-Proton-8.8, and GPU Caps Viewer 1.60 + gpushark worked there aswell. Wonder what happened.

SveSop commented 1 year ago

For some unknown interesting reason, it seems wine-8.10 loads some 32-bit stuff from syswow64 vs. running proton-8 that only loaded system32 (64-bit) dll's? If you run WINEDEBUG=-all,+loaddll wine ./gpushark.exe i cant see proton-8 loading anything from syswow64 vs. 8.10 loads (amongst other things) winedbg.exe from syswow64 which to ME indicate actually loading the 32-bit debugger for some weird reason?

Strange no? Might actually just be some fubar on the wine side of things to "finish the PE conversion" stuff, and something goes avry there.. I think 8.11 is coming this weekend, so we might just chill a couple of days and see if something pops up with 8.11 in that regard.. Nice comparison tho, as i did not really think of testing older versions of wine.

SveSop commented 1 year ago

Just tested wine-devel-8.11 for Ubuntu 22.04, and it was no change there..

I use dxvk + nvapi (from my nvidia-libs repo) WITHOUT using nvml - GPU Caps Viewer 1.60 will start although not showing temps and whatnot that needs nvml. Running gpushark will cause a crash. If i use nvml copying it into the binary structure of wine-devel and do wineboot -u so it is enabled, it is the other way around! gpushark will fire up and show temps and whatnot and seemingly work fine, however GPU Caps Viewer will crash 🤔

I have no clue whats up with that tbh.. It is some changes to the way 32-bit libs gets loaded on newer wine versions that seems clear tho, but why it would "switch" like this i dunno. The apps uses somewhat different calls ofc, and it also depends if nvml is present or not, so it could be something horribly trivial like a TRACE line of sorts (easily crashing with those pesky casts if done strangly i guess).

Anyway.. no immediate change with 8.11, so investigation goes forward i guess 😄

Saancreed commented 1 year ago

I use dxvk + nvapi (from my nvidia-libs repo) WITHOUT using nvml - GPU Caps Viewer 1.60 will start although not showing temps and whatnot that needs nvml. Running gpushark will cause a crash. If i use nvml copying it into the binary structure of wine-devel and do wineboot -u so it is enabled, it is the other way around! gpushark will fire up and show temps and whatnot and seemingly work fine, however GPU Caps Viewer will crash

I rebuilt wine-tkg based on staging-8.11 and I'm seeing the same thing over here :weary:

But when I export WINEDLLOVERRIDES=nvml= so NVML is disabled, the result is the same as it was for me on 8.7: Caps Viewer works but gpushark crashes, so it's possible that gpushark just doesn't handle the case of missing NVML correctly. However, Caps Viewer crashing and only when NVML is enabled on newer Wine versions is concerning, I'll try to take a look at that soon. I'm just still wondering how this works on Windows if there's no 32-bit NVML over there and both GPU Caps Viewer and gpushark are 32-bit applications.

Saancreed commented 1 year ago

I have no idea how to reproduce this on plain or staging flavors of Wine 8.11, even with DXVK, nvapi and nvml in the prefix all I see is GPU n: Wine Adapter for both my GPUs and vendor/device IDs for both of them are just zeroed:

image

From what I can tell, GPU Caps Viewer and gpushark don't even attempt to load dxgi, nvapi or nvml, they all just bail on "unknown vendor". So far, only protonified builds of wine-tkg seem to correctly report vendor ID, causing nvml to be loaded and crashing one of the apps later on.

Actually, nevermind that, I just failed to install nvapi correctly, now it crashes on plain Wine as well.

Saancreed commented 1 year ago

I've bisected this to https://gitlab.winehq.org/wine/wine/-/commit/354a8bb1f4a65bdec052606f2799db9e2907b5b1, reverting that commit allows the application to run with NVML even on wine-tkg-staging-protonified 8.11, but I'm still not sure if this isn't application bug. I suppose I should report this to WineHQ.

SveSop commented 1 year ago

Hmm.. some sort of 32-bit heap issue you think?

I suppose that "shared 64-bit" crud that makes 32-bit code run just as well on 64 bit arch wine is moving towards got something to do with it? (Explanation sux.. hehe).. But from my very limited understanding, is that not what is going to happen and are what IS happening on a 64-bit windows installation these days? No huge need for 32-bit libraries, since 32-bit programs will use 64-bit libs/dlls just aswell?

So.. a 32-bit app loads the 64-bit dll and uses that even under 32-bit calls or someshit - atleast is the gist of what i thought would be the deal with the "new" "32on64" deal that is somewhat experimental in wine atm? That IS what a windows is using nowadays i think, and you do not need 32-bit libs anymore and whatever is there on an live install is just old backwards compatibility things....

Would be interesting to experiment a bit with this eventually.. See what happens with the nvapi/nvml/nvcuda stuff if wine is compiled as this "32on64" method.

https://github.com/Frogging-Family/wine-tkg-git/blob/master/wine-tkg-git/wine-tkg-profiles/advanced-customization.cfg#L67-L69

Saancreed commented 1 year ago

Hmm.. some sort of 32-bit heap issue you think?

Perhaps? This looks like the program expected heaps to be larger and now they aren't anymore so it accesses memory past the boundary of what it owns. But, eh, it's just a speculation, I don't know enough about memory management stuff at the OS level to actually claim anything of value here.

Would be interesting to experiment a bit with this eventually.. See what happens with the nvapi/nvml/nvcuda stuff if wine is compiled as this "32on64" method.

The new and shiny experimental wow64 mode probably won't work with nvml and/or nvcuda because they always load native Linux library of matching bitness and don't make attempts to expose a wow64 interface for 32-bit PE library to use.

Anyway, I reported my findings here: https://bugs.winehq.org/show_bug.cgi?id=55140

SveSop commented 1 year ago

Reverted this commit and it seems to work fine with wine-tkg-staging-8.11 then IF you use nvml ... If one do not, it crashes gpushark.exe like before. It could be some error with whatever library newer versions of GPU Caps Viewer uses perhaps?

Version 1.60.0.0 - 2023.05.25

  • added support of NVIDIA GeForce RTX 4060 Ti.
  • added support of AMD Radeon RX 7600. ! minor changes in the creation of the 3D window for the demos. ! updated: GPU Shark 0.29.4.0. ! updated: ZoomGPU 1.37.4 (GPU monitoring library). ! updated: GeeXLab libs version 0.52.0.

ZoomGPU ? Wonder if one can update/downgrade this.. https://www.geeks3d.com/20091218/zoomgpu-sdk-1-3-0/

Saancreed commented 1 year ago

If one do not, it crashes gpushark.exe like before. It could be some error with whatever library newer versions of GPU Caps Viewer uses perhaps?

I don't know, and tbh I don't care that much about it to make me want to figure it out. One could say that something not working without NVML is most likely not a wine-nvml bug :upside_down_face:

Also, I just remembered…

For some unknown interesting reason, it seems wine-8.10 loads some 32-bit stuff from syswow64 vs. running proton-8 that only loaded system32 (64-bit) dll's? If you run WINEDEBUG=-all,+loaddll wine ./gpushark.exe i cant see proton-8 loading anything from syswow64 vs. 8.10 loads (amongst other things) winedbg.exe from syswow64 which to ME indicate actually loading the 32-bit debugger for some weird reason?

I think that was just a red herring, there could be some changes on how loaddll reports what is being loaded (because, after all, system32 is virtualized for 32-bit apps so it looks like if it contained 32-bit libraries for them) so this might be not very reliable method of determining this. Something like lsof that show you UNIX paths for open files would probably be better, if you can catch the application before it dies.

SveSop commented 1 year ago

Also, I just remembered…

For some unknown interesting reason, it seems wine-8.10 loads some 32-bit stuff from syswow64 vs. running proton-8 that only loaded system32 (64-bit) dll's? If you run WINEDEBUG=-all,+loaddll wine ./gpushark.exe i cant see proton-8 loading anything from syswow64 vs. 8.10 loads (amongst other things) winedbg.exe from syswow64 which to ME indicate actually loading the 32-bit debugger for some weird reason?

I think that was just a red herring, there could be some changes on how loaddll reports what is being loaded (because, after all, system32 is virtualized for 32-bit apps so it looks like if it contained 32-bit libraries for them) so this might be not very reliable method of determining this. Something like lsof that show you UNIX paths for open files would probably be better, if you can catch the application before it dies.

That is probably true 👍 Yeah, ill look into that, and i will see if i can compile one of my cuda tests as 32-bit and see what windows 10 actually do behind the scenes when it comes to loading .dll's when running in "32-bit mode".

SveSop commented 1 year ago

Hmpf.. After actually running a few 32-bit nvapi/cuda apps on my windows 10 box, c:\windows\sysWOW64 seems to contain the 32-bit .dll's from nvidia. Not entirely sure if i dreamt that i could not find them, or if they are somewhere in the driver repo that windows has (i think) in the c:\windows\WinSxS folder and it gets copied over once you actually use a 32-bit app perhaps? I WAS so sure i did not find it before, but then again... i had not done any 32-bit tests on my windows install either.

Maybe it is just some deal where it wont get copied over until you actually need it or sumtin? Or.. it could be the heat these days that plays tricks with my mind 😏

No c:\windows\sysWOW64\nvml.dll tho, but that you already knew.. GPU Caps Viewer and its like would not use that in windows anyway i would think, since nvapi gets all the needed information directly internally.

SveSop commented 1 year ago

Did some debugging on a couple of the CUDA samples in Visual Studio, and looked at what .dll's was loaded. Running 64-bit version of SimpleD3D12 showed lines like this: 'simpleD3D12.exe' (Win32): Loaded 'C:\Windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_675be35f1ba2315e\nvcuda64.dll'. Running 32-bit version of SimpleD3D10 showed this: 'simpleD3D10.exe' (Win32): Loaded 'C:\Windows\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_675be35f1ba2315e\nvcuda32.dll'.

So i guess when running 32/64 bit nvidia apps under windows, it grabs whatever dll it needs from the DriverStore folder (kinda was on the right track there 😄) Have not gotten around to testing what happens if i compile something using nvml tho...

Saancreed commented 1 year ago

You could also try running GPU Caps Viewer or gpushark on Windows and looking at the libraries they load using something like procexp.

SveSop commented 1 year ago

You could also try running GPU Caps Viewer or gpushark on Windows and looking at the libraries they load using something like procexp.

Thanks for the tip... i was not aware that thingy even was around anymore.. Nice 😄

I can see sysWOW64\nvapi.dll and nvcuda.dll loaded aswell as nvcuda32.dll from the driverstore folder... so i found something interesting (again).. In the "driverstore" folder there is also two files called nvcuda_loader32.dll and nvcuda_loader64.dll.. the nvcuda_loader32.dll is exactly the same size as the nvcuda.dll that is in the c:\windows\syswow64 folder 😮 (same for the nvcuda_loader64.dll vs the nvcuda.dll in system32 folder...) The nvcuda32.dll and nvcuda64.dll is a lot larger than these "loader" dll's.

So.. some trickery where the windows folder contains "renamed" loader libs to load the actual .dll's somehow?

The nvidia driver also contains libxxx.so.1 libs, which i imagine is for WSL perhaps? Have not really - and dont think i will - study this in horrible great detail. Safe to say, windows does a lot more under the hood than i care to find out i guess, but 32-bit and 64-bit seems to work fine without issues atleast.

Saancreed commented 1 year ago

So.. some trickery where the windows folder contains "renamed" loader libs to load the actual .dll's somehow?

I have no idea how driver installations in Windows actually work, though it's not hard to imagine that files distributed by the vendor have some additional data that tells Windows where to install it (in some well-known directory, like System32) and under what canonical name (so, nvcuda.dll because that's what the consumer applications pass to LoadLibrary). it could just be a hard-link to nvcuda_loader*.dll for all I know (except ntfs3 tells me it's not).

Now, why does NV do this through this intermediate loader, I'm not sure, but if file sizes are of any importance, apparently there's no such thing on Linux:

$ du -h /windows/Windows/System32/DriverStore/FileRepository/nvamig.inf_amd64_*/nvcuda*64.dll /windows/Windows/System32/nvcuda.dll /usr/lib/libcuda.so.535.54.03
25M     /windows/Windows/System32/DriverStore/FileRepository/nvamig.inf_amd64_2e325f1dc704cd7d/nvcuda64.dll
3.2M    /windows/Windows/System32/DriverStore/FileRepository/nvamig.inf_amd64_2e325f1dc704cd7d/nvcuda_loader64.dll
3.2M    /windows/Windows/System32/nvcuda.dll
28M     /usr/lib/libcuda.so.535.54.03

The nvidia driver also contains libxxx.so.1 libs, which i imagine is for WSL perhaps?

Yup, I believe so.

Saancreed commented 1 year ago

That said, everything I've seen so far just confirms my belief that there is no 32-bit nvml.dll on Windows so I can only guess what were the authors of GPU Caps Viewer and gpushark thinking while implementing features relying on something that doesn't exist.

SveSop commented 1 year ago

That said, everything I've seen so far just confirms my belief that there is no 32-bit nvml.dll on Windows so I can only guess what were the authors of GPU Caps Viewer and gpushark thinking while implementing features relying on something that doesn't exist.

Afaik GPU Caps Viewer and gpushark does not load/call nvml at all... Everything goes through nvapi.dll. NVIDIA version of nvapi does not load nvml either - atleast not directly to my knowledge - as all functions in nvapi is done internally as part of the driver. That we need to use nvml in dxvk-nvapi to grab various hardware info is because there is no other way to "talk to the driver" like that in Linux.

Testing wine-devel without nvapi just adding nvml to the mix does not make GPU Caps Viewer or gpushark do anything towards nvml. I dont really have an old enough standard version of wine-staging laying around i think (that has the old implementation of nvapi), and my old hack statically linked nvml since that was like pre-wine-nvml.

But i am fairly sure the only reason we see anything related to nvml is because dxvk-nvml uses it, and it is not really common for windows programs to use it at all, since nvapi does "everything" in that regard.

Saancreed commented 1 year ago

Actually that's not the case!

$ strings gxl_x32.dll | grep ^nvml
nvml_get_num_gpus
nvml_update
nvml_get_gpu_device_id
nvml_get_gpu_name
nvml_get_gpu_cuda_compute_capability
nvml_get_gpu_bus_id
nvml_get_gpu_temperature_thresholds
nvml_get_gpu_core_temperature
nvml_get_gpu_max_clock_speeds
nvml_get_gpu_fan_speed
nvml_get_gpu_current_power
nvml_get_gpu_current_clock_speeds
nvml_get_gpu_power_management_limits
nvml_get_gpu_enforced_power_limit
nvml_get_gpu_clocks_throttle_reason
nvml_get_gpu_pcie_throughput
nvml_update_process_utilization
nvml_get_gpu_utilization
nvml_get_gpu_process_info
nvml_get_gpu_process_count

$ strings gxcplugins/plugin_gxc_gpumon_x32.dll | grep ^nvml
nvml.dll
nvmlInit
nvmlInit_v2
nvmlShutdown
nvmlSystemGetNVMLVersion
nvmlSystemGetProcessName
nvmlDeviceGetCount
nvmlDeviceGetCount_v2
nvmlDeviceGetHandleByIndex
nvmlDeviceGetName
nvmlDeviceGetPciInfo
nvmlSystemGetDriverVersion
nvmlDeviceGetBrand
nvmlDeviceGetSerial
nvmlDeviceGetMemoryInfo
nvmlDeviceGetCudaComputeCapability
nvmlDeviceGetComputeRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetClockInfo
nvmlDeviceGetMaxClockInfo
nvmlDeviceGetClock
nvmlDeviceGetMaxCustomerBoostClock
nvmlDeviceGetFanSpeed
nvmlDeviceGetTemperature
nvmlDeviceGetTemperatureThreshold
nvmlDeviceGetPowerUsage
nvmlDeviceGetTotalEnergyConsumption
nvmlDeviceGetEnforcedPowerLimit
nvmlDeviceGetPowerManagementDefaultLimit
nvmlDeviceGetPowerManagementLimitConstraints
nvmlDeviceGetPowerManagementLimit
nvmlDeviceGetPowerState
nvmlDeviceGetPcieThroughput
nvmlDeviceGetCurrentClocksThrottleReasons
nvmlDeviceGetUtilizationRates
nvmlDeviceGetProcessUtilization

I'm pretty sure no version of dxvk-nvapi makes use of nvmlDeviceGetPower* functions and yet with WINEDEBUG=+nvml you will see them being called.

SveSop commented 1 year ago

https://forums.developer.nvidia.com/t/nvml-lib-for-x86-platform/58298 https://forums.developer.nvidia.com/t/is-there-a-32bit-nvml-dll/37138

So.. no nvml.dll for 32-bit windows - mostly because you do not need it. GPU Caps Viewer & plugins has been around for a long time, so that it has remnants of old calls or try to load dlls that no longer exists is no huge surprise i guess. Since running a relevant 32-bit os on relevant hardware that still gets drivers does not happen at all anymore, old calls causing havoc on the platform it is written for (windows) is not likely to happen. and no windows users detect any issues with this.

Enter the crazy Linux ppl using Wine and all bets are off 🤣

SveSop commented 1 year ago

Maybe not related... but it very well could be: I am not able to run any of my OptiX samples from the SDK when using wine-8xx. I was not able to run it with wine-staging-7.22 either, but 7.2 and 7.5 runs fine.

I had been using a old Lutris 7.2 version on my DAZ Studio prefix (where i also had tested the OptiX samples), and thought i would do some testing on newer wine versions, and created a new prefix for that with wine-8.11. There is no errors or anything that i am able to spot in either the wine, nvcuda or nvoptix logs tho, just hanging without any graphical output and without finishing - indefinitely 😢

Ill see if i can dig up and recompile some more wine sources inbetween 7.5 and 7.22 and see if i can narrow down the search a bit, but i must admit i do not have a horrible good setup atm for doing bisecting.

SveSop commented 1 year ago

I do wonder how GPU Caps Viewer obtains my full 64-bit VRAM value tho... Running in windows i get 8GB vram on the same hardware that dxvk-nvapi only seem to report 3072 (that i assume it gets from dxvk.. in a 32-bit call). Not sure how it is done in windows tbh 🤔

jp7677 commented 1 year ago

I do wonder how GPU Caps Viewer obtains my full 64-bit VRAM value tho... Running in windows i get 8GB vram on the same hardware that dxvk-nvapi only seem to report 3072 (that i assume it gets from dxvk.. in a 32-bit call). Not sure how it is done in windows tbh 🤔

That I can answer :), it has to do with https://github.com/jp7677/dxvk-nvapi/blob/master/src/sysinfo/nvapi_adapter.cpp#L190 . NVAPI can report a higher number than DXGI on 32bits since in NVAPI it’s KBytes whereas DXGI reports in Bytes. But since we take the value over from DXGI to honor any DXVK memory overrides, DXVK-NVAPI is thus limited to the clamped value that DXGI reports when being compiled for 32bits.

SveSop commented 1 year ago

Maybe not related... but it very well could be: I am not able to run any of my OptiX samples from the SDK when using wine-8xx. I was not able to run it with wine-staging-7.22 either, but 7.2 and 7.5 runs fine.

Did some testing with the WineHQ provided packages for Ubuntu 22.04 that is running on my testbox, and the latest wine-staging version able to run the OptiX SDK samples without hanging is - wine-staging-7.12

Not able to figure out what or why it happens - it just hangs with no apparent error. Ofc running a +relay log in wine could be done i guess, but that seemed as usual to build gigs of text and grind everything to a frigging halt 😢

Oh.. and for me running even the old wine-staging-7.x versions makes gpushark.exe (from the GPU Caps Viewer 1.60 pack) crash when starting when i do NOT use nvml library. If i do, things seems to work, but tends to crash GPU Caps Viewer upon exiting for some strange reason...

Now, why would gpushark.exe crash when NOT using nvml? Disable nvapi, and there is no crash, so there IS something triggering nvml to get data. Could it be something in the likes of one of those gpumon dll's being statically linked with nvml when it was compiled or somecrap?

I mean.. all the CUDA SDK stuff is set up to statically link with CudaRuntimeAPI... and i would not be overly surprised if there is some lightweight "nvml runtime api" crap for developers going on here - but the likelyhood of it being 32-bit starts to slim down i would think. Once nvapi actually loads nvml.dll it "takes presedence" or something? The whole thing is strange i think.

I see there are some new nvml calls in the newer versions of gxl_x32.dll vs. the one from the GPU Caps Viewer 1.56 package too (did not help to replace them tho), but that would kinda indicate that they still ADD nvml calls when updating the package... for 32-bit?! Why? Why would the new version of GPU Caps Viewer attempt to make nvml calls from a 32-bit .dll when there IS NO 32-bit nvml?

I am going to loose my mind over this shit one day i tells ya!! 👿

Saancreed commented 1 year ago

Not able to figure out what or why it happens - it just hangs with no apparent error.

Do you know which call hangs? Try building nvcuda and friends with debug symbols, attach gdb or winedbg then interrupt the process once it hangs and inspect the stack trace, maybe that will tell you something.

For what it's worth, your old Optix samples targeting ABI 55 don't hang for me with wine-tkg-staging-protonified-8.9.1, instead if I start them with no arguments they error out beginning with

ERROR: C:\Users\ssopl\source\repos\OptiX_Apps\apps\intro_driver\src\Application.cpp(2175): cuGraphicsGLRegisterBuffer(&m_cud
aGraphicsResource, m_pbo, CU_GRAPHICS_REGISTER_FLAGS_NONE) failed with CUDA_ERROR_UNKNOWN (999)
ERROR: C:\Users\ssopl\source\repos\OptiX_Apps\apps\intro_driver\src\Application.cpp(2179): cuGraphicsMapResources(1, &m_cuda
GraphicsResource, m_cudaStream) failed with CUDA_ERROR_INVALID_HANDLE (400)
ERROR: C:\Users\ssopl\source\repos\OptiX_Apps\apps\intro_driver\src\Application.cpp(2180): cuGraphicsResourceGetMappedPointe
r(reinterpret_cast<CUdeviceptr*>(&m_systemParameter.outputBuffer), &size, m_cudaGraphicsResource) failed with CUDA_ERROR_INV
ALID_HANDLE (400)
ERROR: C:\Users\ssopl\source\repos\OptiX_Apps\apps\intro_driver\src\Application.cpp(2181): cuGraphicsUnmapResources(1, &m_cu
daGraphicsResource, m_cudaStream) failed with CUDA_ERROR_INVALID_HANDLE (400)

… and going downhill from there. Because this looks like some kind of OGL/CUDA interop, I retried with --nopbo and then they started working. I think it's probably something Wine changed about how they expose GL resources to Windows applications, perhaps due to PE conversion, and now we'd have to unwrap them before passing them to native NV driver ¯\_(ツ)_/¯

That said, with that many random issues popping out recently and OptiX 8.0 just around the corner (or at least https://developer.nvidia.com/ teasing in Recently Updated section about it, even though the link leads to OptiX 7 page :upside_down_face:) I'm not looking forward to see what more interesting showstoppers await us.

Now, why would gpushark.exe crash when NOT using nvml? Disable nvapi, and there is no crash, so there IS something triggering nvml to get data. Could it be something in the likes of one of those gpumon dll's being statically linked with nvml when it was compiled or somecrap?

I mean.. all the CUDA SDK stuff is set up to statically link with CudaRuntimeAPI... and i would not be overly surprised if there is some lightweight "nvml runtime api" crap for developers going on here - but the likelyhood of it being 32-bit starts to slim down i would think.

Even if there was, it still has to gracefully handle situations where nvml.dll is not present. As far as I know, the Windows version of NVML SDK only has nvml.lib that tells the linker about functions exported by nvml.dll because apparently Windows linkers can't use .dll when building like Linux linkers use .so for linking both at build time and at run time. But if you were to link to that instead of calling LoadLibrary, it should refuse to launch on systems that don't have it. Feel free to try though.

SveSop commented 1 year ago

There is a strange issue - atleast for me - starting Ubuntu and then running OptiX or CUDA apps directly from Wine. It just fails with similar as "CUDA_ERROR_INVALID_HANDLE".. However starting a native cuda or OptiX from the SDK samples make the wine ones work fine. (well.. save for the discussion about versions ofc).

So, what i had to do was to fiddle with some UDEV loading for Ubuntu to actually get nvidia driver to create a device... then it "just works (tm)" when booting up. Just so it is not one of those snags you hit..

The issue with the OptiX SDK samples is that once they are compiled they HAVE to be in the directory they were compiled in, since the source hard-links loads of images and whatever. There is probably ways around that, but the explanations i found was way too convoluted for me since i am by no means a "Vistual Studio guy" 😞 I do have pre-compiled samples for SDK 7.0.0, 7.3.0, 7.4.0, 7.5.0, 7.6.0 and 7.7.0 . They HAVE to be placed in the C:\ProgramData\NVIDIA Corporation\OptiX SDK 7.x.x\ folder inside the wineprefix you use, and run from the subfolder \build\bin\Release to work.

You dont need all the SDK's tho, if you are interested i could zip down SDK 7.7.0 ? Although working, they are compiled with CUDA 11.5 instead of 12.0 i think it should have been compiled with because of the CUDA Runtime API issue.

Saancreed commented 1 year ago

So, what i had to do was to fiddle with some UDEV loading for Ubuntu to actually get nvidia driver to create a device... then it "just works (tm)" when booting up. Just so it is not one of those snags you hit..

Nope, on Arch relevant udev rules are already included, both in repo packages and in TKG's nvidia-all which is the flavor I'm using myself.

You dont need all the SDK's tho, if you are interested i could zip down SDK 7.7.0 ? Although working, they are compiled with CUDA 11.5 instead of 12.0 i think it should have been compiled with because of the CUDA Runtime API issue.

Sure, I could give it a try, but can we please move this conversation to a more approriate repo?

SveSop commented 1 year ago

Even if there was, it still has to gracefully handle situations where nvml.dll is not present. As far as I know, the Windows version of NVML SDK only has nvml.lib that tells the linker about functions exported by nvml.dll because apparently Windows linkers can't use .dll when building like Linux linkers use .so for linking both at build time and at run time. But if you were to link to that instead of calling LoadLibrary, it should refuse to launch on systems that don't have it. Feel free to try though.

I am not 100% up to date on how all this linker business works, but yeah, you do not link with .dll directly like you do on Linux. How this is done when i deal with this CUDA stuff i do not know, cos i can choose to have it load cudart64_100.dll or whatever its called WITHOUT using loadlibrary in my sample.. or if i do not disable this "static link" stuff, it will be statically compiled in and locked down for whatever version the sample is at.

Is there such a thing that would "loadlibrary" a .dll if its found, but use it statically linked if not? I have no clue, but that seems very odd to me tho. If those "gpumon" thingys in GPU Caps Viewer had been statically linked i would expect it to fail 100% of all times since the calls would not be working AT ALL when running under wine with absolutely NO linkage to the real Linux .so lib. So.. What is failing? Making a call to a .dll that is not loaded should not be dangerous at all? And since there is no such thing as 32-bit version of nvml.dll in windows, why is it not crashing left and right there?

I think there is some fishy things going on with how wine is handling something here, and it could very well be related to the previous find you had above - even when nvml.dll is NOT present. (PE libs and all new stuff)

Saancreed commented 1 year ago

Is there such a thing that would "loadlibrary" a .dll if its found, but use it statically linked if not?

Certainly not a full implementation, but Optix does this in a very limited manner: it provides builtin implementation of optixGetErrorName and optixGetErrorString that is supposed to help with debugging (funnily enough) library loading issues: https://raytracing-docs.nvidia.com/optix7/api/optix__stubs_8h_source.html

It's possible that there is some extra magic included in nvapi.lib when linking using the usual SDK on Windows but I can't tell. But because NVML library version must match the driver version to be usable at all, I don't think we are at risk of that here.

If those "gpumon" thingys in GPU Caps Viewer had been statically linked i would expect it to fail 100% of all times since the calls would not be working AT ALL when running under wine with absolutely NO linkage to the real Linux .so lib.

If I were to guess, NVML talks to the driver using something like /dev/nvidia0 on Linux and probably something like D3DKMTQueryAdapterInfo on Windows, the latter probably being a nightmare to fully implement in Wine in the manner expected by Nvidia's own library. If some theoretical static implementation were the try that, I can see why it would explode.

I'm just hoping that gpumon doesn't go like if (lib = LoadLibrary("nvml.dll")) { /* use nvml.dll */ } else { /* use woefully undocumented and reverse engineered private Nvidia interfaces as a replacement that happens to still work on Windows */ } :sweat_smile:

So.. What is failing? Making a call to a .dll that is not loaded should not be dangerous at all?

Not any more dangerous than it is on Windows.

SveSop commented 1 year ago

I'm just hoping that gpumon does go like if (lib = LoadLibrary("nvml.dll)) { /* use nvml.dll */ } else { /* use woefully undocumented and reverse engineered private Nvidia interfaces as a replacement that happens to still work on Windows */ } 😅

🤣 Indeed... Still can't really explain WHY the devs would even consider implementing that for the 32-bit lib.. Unless this "gpumon" thingy SDK whatnot is easily available and used as 64-bit in other projects, and its only GPU Caps Viewer that compiles and uses the 32-bit ones from the same source. Kinda like "Hey, we wanna upgrade gpumon to newest version.. give us a copy" and wham.. they get a 32-bit one since GPU Caps Viewer is 32-bit. Could be as simple as that.

SveSop commented 1 year ago

I think this is a bit more complicated than at first glance, and MAY have to do with dxvk-nvapi aswell... There is a bug with > wine-staging-8.9 where if you enable "Virtual Desktop", it will not work: https://bugs.winehq.org/show_bug.cgi?id=55085

Now, the patch posted there "fixes" being able to turn it on/off again, but if i run GPU Caps Viewer in a prefix with DXVK/DXVK-NVAPI and NVML + start the Vulkan "Geomechanical" 3D demo, it will crash... however GPU Caps Viewer 1.56 will NOT for some reason.

I also have a hard time running these 3D Demos without using DXVK-NVAPI, so i guess nvapi does some sort of magic behind the scenes here. DXVK reports some issues with trying to find monitors when "Virtual Desktop" is enabled for some reason, but this seem to work fine when its not enabled (after the revert from your earlier bug report).

So.. a NEW issue then i guess, but it could perhaps be connected? Some sort of change with GPU Caps Viewer and the updates that somehow throw DXVK-NVAPI off in reporting something, and THEN it all crumbles from there? (Default wine-staging vulkan implementation did not seem to be able to run the more advanced vulkan Demos with winevulkan alone tho...)

Sheesh.. need to bug @jp7677 about this issue too now perhaps 😏

SveSop commented 1 year ago

Just a quick FYI.. when i use my old nvapi implementation with libnvidia.ml.so statically linked, wine does not crash... But i guess that is not horribly surprising perhaps...

Saancreed commented 1 year ago

But i guess that is not horribly surprising perhaps...

If wine-nvml is still available in the prefix and it's only the behavior of nvapi that changes then yes, I'd say it is a bit surprising. To my understanding, the application first calls nvapi and if results it gets from nvapi calls matches what it expects, it then proceeds to call nvml and only after a few nvml calls does it crash and burn. If replacing nvapi implementation with one that does not use nvml.dll changes anything, I believe it could be due to the application getting unexpected results from nvapi calls and therefore never trying to load and call nvml.dll by itself… unless it does still actually load and call nvml.dll but this time successfully which would be strange indeed, and would imply that dxvk-nvapi's nvml usage somehow corrupts the state.

I wonder if it would explode if we used a modified version of dxvk-nvapi that never attempted to load nvml.dll and just pretended that nvml is not available :thinking:

SveSop commented 1 year ago

I did some testing.. hopefully able to make it somewhat coherent. Tests used default wine-staging-8.13 (Distro compiled with no additional patches or reverts). All tests had DXVK installed in the prefix.

1 - Nothing added: GPU Caps Viewer and gpushark - OK (No useful nvidia data ofc)

2 - with dxvk-nvapi GPU Caps Viewer - OK gpushark - Crashes

3 - with dxvk-nvapi + wine-nvml GPU Caps Viewer - Crash gpushark - OK

4 - Modified dxvk-nvapi to load nvml_123.dll instead (that does not exist ofc) + keep nvml in the prefix GPU Caps Viewer - Crash gpushark - Crash

5 - As test 4 just removed nvml from wine+prefix Same result as test 2

6 - Used old nvapi implementation with no nvml in prefix GPU Caps Viewer - OK gpushark - OK (libnvidia-ml.so.1 is clearly used as it gets temps++)

7 - Used old nvapi implementation + adding nvml to wine/prefix GPU Caps Viewer - Crash gpushark - Crash

I also did a test 8 without nvapi, but WITH nvml in the prefix/wine. This behaved just like test 1, and did NOT seem to load nvml.. So, if nvapi is not there - indicates no NVIDIA adapter present - nvml is not loaded.

SveSop commented 1 year ago

Hmpf... Found something interesting! Back when i was fiddling with the old nvapi implementation, i made a "lite" variant with less calls, kinda something i thought could be needed for gaming and not "full on proof of concept".. And guess what? Yeah.. running THAT and starting gpushark also crashes like DXVK-NVAPI 😮

So... IF gpushark does not get data from some call(s), bad shit happens it seems. Adding nvml to the mix will make BOTH gpushark and GPU Caps Viewer crash just like before with the "full" old nvapi version.

So.. about this "heap" business... Does loading .so lib somewhere in this mix need "more heap space" somehow? (Since that revert seems to fix shit).

Saancreed commented 1 year ago

So.. about this "heap" business... Does loading .so lib somewhere in this mix need "more heap space" somehow? (Since that revert seems to fix shit).

I have absolutely no idea why does that commit breaks anything. One theory was that there is some memory corruption happening somewhere but debugging this sounds to me like a nightmare. Which probably means that until someone like pgofman takes a look at this, I'd be happy to forget that 32-bit nvml.dll ever existed.

SveSop commented 1 year ago

I changed only this, and it seemed to fix it..

diff --git a/dlls/ntdll/heap.c b/dlls/ntdll/heap.c
index 2ca93ec..51088eb 100644
--- a/dlls/ntdll/heap.c
+++ b/dlls/ntdll/heap.c
@@ -310,7 +310,7 @@ C_ASSERT( offsetof(struct heap, subheap) <= REGION_ALIGN - 1 );

 #define HEAP_MAGIC       ((DWORD)('H' | ('E'<<8) | ('A'<<16) | ('P'<<24)))

-#define HEAP_INITIAL_SIZE      0x10000
+#define HEAP_INITIAL_SIZE      0x100000
 #define HEAP_INITIAL_GROW_SIZE 0x100000
 #define HEAP_MAX_GROW_SIZE     0xfd0000

If i used HEAP_INITIAL_SIZE 0x80000 it still crashed... No point experimenting with decimals i guess...

Any thoughts on how to test for horrible effects for using this? It does seem as the previous version of this used a MUCH larger heap tho, as it seemed to be HEAP_DEF_SIZE (0x40000 * BLOCK_ALIGN) and from what i gather BLOCK_ALIGN is 2x sizeof(void *) .. void being 4 bytes (for 32 bit), so 40000 x 8? That kinda seems A LOT more than the 10000 in the new patch, but i dont really know how many times it can "grow" before its happy...

Saancreed commented 1 year ago

Nope, I'm afraid this entered areas of Wine where I'm unable to help anyone, least of all myself, long ago now :sweat_smile:

SveSop commented 1 year ago

I was not aware.. but there is such a thing as called "Gpushark2" available here: https://www.geeks3d.com/dl/show/704

This has both x32 and x64 versions in the archive... Atleast it should be the same source/functions/calls/whatever as a comparison between them?

Saancreed commented 1 year ago

Okay, so I had a moment to take a look at this and… unsurprisingly, 64-bit version works and 32-bit version crashes and burns, but I was unable to isolate any particular call that causes it:

This is just sad.

Saancreed commented 1 year ago

I have a workaround on this branch but to be honest, this is less than ideal and I don't like this.