NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
14.73k stars 1.21k forks source link

Invalid pointer free #585

Open BlueGoliath opened 7 months ago

BlueGoliath commented 7 months ago

NVIDIA Open GPU Kernel Modules Version

545.29.06

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Arch Linux

Kernel Release

6.6.7-zen1-1-zen

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

RTX 4060

Describe the bug

When attempting to run an JavaFX/NVML based Nvidia GPU information/monitoring/overclocking utility made by me, the application will crash with an invalid pointer-free error. This does not happen in the proprietary driver.

Playing around I noticed that:

  1. A normal JavaFX application runs just fine.
  2. Other software based on NVML(nvidia-smi) seems to run just fine.
  3. If I replace the app contents with some blank filler content(e.g. a Button), the app will show for a few seconds and then crash with the above error.

My app is not publically available for Linux. A build can be provided so long as it is kept confidential. I'm willing to assist in any debugging by modifying code as needed.

To Reproduce

Run the application either from IDE or via the included shell script.

Bug Incidence

Always

aritger commented 6 months ago

Thank you for the bug report. I'm not too familiar with JavaFX development to know how, but is it possible to get a backtrace or similar at the point of the invalid pointer-free?

My guess is that the crash is somewhere inside NVML, presumably perturbed by some difference in what APIs are provided to NVML by the closed kernel modules versus the open kernel modules.

But, without a reproduction or backtrace, it will be difficult to resolve. Is this version:

If I replace the app contents with some blank filler content(e.g. a Button), the app
will show for a few seconds and then crash with the above error.

something that could be shared?

If you prefer, please email linux-bugs@nvidia.com, reference this Issue Tracker, and we can work out how to share a reproduction case.

BlueGoliath commented 6 months ago

Normally a core dump would be dumped in the CWD but it isn't in this case for whatever reason.

After disabling multi-threading(worker threads = 1), I modified the update code to print the attribute names and return values which prints:

Driver Version : NVML_SUCCESS
CUDA Version : NVML_SUCCESS
NVML Version : NVML_SUCCESS
Name : NVML_SUCCESS
Brand : NVML_SUCCESS
PCIe Meta : NVML_SUCCESS
UUID : NVML_SUCCESS
VBIOS Version : NVML_SUCCESS
Graphics Clock Customer Max : NVML_ERROR_NOT_SUPPORTED
SM Clock Customer Max : NVML_ERROR_NOT_SUPPORTED
Memory Clock Customer Max : NVML_ERROR_NOT_SUPPORTED
Video Clock Customer Max : NVML_ERROR_NOT_SUPPORTED
Performance Levels : NVML_SUCCESS
Memory Bus Width : NVML_SUCCESS
Slowdown Temperature : NVML_SUCCESS
Shutdown Temperature : NVML_SUCCESS
Die Temperature Max : NVML_SUCCESS
Memory Temperature Max : NVML_ERROR_NOT_SUPPORTED
Fan Target Min/Max Meta : NVML_SUCCESS
Power Limit Default : NVML_SUCCESS
Power Limit Meta : NVML_SUCCESS
Temperature Limit Min : NVML_SUCCESS
Temperature Limit Max : NVML_SUCCESS
Power Limit Default : NVML_SUCCESS
Power Limit Min : NVML_SUCCESS
Power Limit Max : NVML_SUCCESS
Power Limit Default : NVML_ERROR_NOT_SUPPORTED
Power Limit Min : NVML_ERROR_NOT_SUPPORTED
Power Limit Max : NVML_ERROR_NOT_SUPPORTED
Power Limit Default : NVML_ERROR_INVALID_ARGUMENT
Power Limit Min : NVML_ERROR_INVALID_ARGUMENT
Power Limit Max : NVML_ERROR_INVALID_ARGUMENT
Graphics Clock VF Meta : NVML_SUCCESS
Memory Clock VF Meta : NVML_SUCCESS
PCIe Link Speed Max : NVML_SUCCESS
Inforom Part Number : NVML_SUCCESS
Inforom ECC Version : NVML_ERROR_NOT_SUPPORTED
Inforom Image Version : NVML_SUCCESS
Inforom OEM Version : NVML_SUCCESS
Inforom Power Version : NVML_ERROR_NOT_SUPPORTED
Inforom Checksum : NVML_SUCCESS
GSP Version : NVML_SUCCESS
GSP Enabled Status : NVML_SUCCESS
Accounting Mode Buffer Size : NVML_SUCCESS
Compute Capability Meta : NVML_SUCCESS
ECC Mode Default : NVML_ERROR_NOT_SUPPORTED
CUDA Cores : NVML_SUCCESS
IRQ Number : NVML_SUCCESS
Utilization Meta : NVML_SUCCESS
Video Encoder Utilization Sampling Period : NVML_SUCCESS
Video Decoder Utilization Meta : NVML_SUCCESS
OFA Engine Utilization Sampling Period : NVML_SUCCESS
JPG Engine Utilization Sampling Period : NVML_SUCCESS
Graphics Clock : NVML_SUCCESS
SM Clock : NVML_SUCCESS
Memory Clock : NVML_SUCCESS
Video Clock : NVML_SUCCESS
Clock Max : NVML_SUCCESS
Clock Max : NVML_SUCCESS
Clock Max : NVML_SUCCESS
Clock Max : NVML_SUCCESS
Graphics Application Clock Default : NVML_ERROR_NOT_SUPPORTED
Memory Application Clock Default : NVML_ERROR_NOT_SUPPORTED
Graphics Application Clock Target : NVML_ERROR_NOT_SUPPORTED
Memory Application Clock Target : NVML_ERROR_NOT_SUPPORTED
Graphics Clock VF Offset : NVML_SUCCESS
Memory Clock VF Offset : NVML_SUCCESS
Memory Meta V2 : NVML_SUCCESS
BAR1 Memory Meta : NVML_SUCCESS
Video Encoder Meta : NVML_SUCCESS
Video Encoder H264 Capacity : NVML_SUCCESS
Video Encoder HEVC Capacity : NVML_SUCCESS
Video Encoder AV1 Capacity : NVML_SUCCESS
Performance State : NVML_SUCCESS
Dynamic PState Info Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance State Min/Max Clocks Meta : NVML_SUCCESS
Performance Limit Meta : NVML_SUCCESS
Power Performance Policy Violation Time : NVML_SUCCESS
Thermal Performance Policy Violation Time : NVML_SUCCESS
Sync Boost Performance Policy Violation Time : NVML_SUCCESS
Board Limit Performance Policy Violation Time : NVML_SUCCESS
Low Utilization Performance Policy Violation Time : NVML_SUCCESS
Reliability Performance Policy Violation Time : NVML_SUCCESS
App Clocks Performance Policy Violation Time : NVML_SUCCESS
Base Clocks Performance Policy Violation Time : NVML_SUCCESS
Power Source : NVML_SUCCESS
Power Draw : NVML_ERROR_NOT_SUPPORTED
Power Draw Average : NVML_ERROR_NOT_SUPPORTED
Power Draw Total : NVML_ERROR_NOT_SUPPORTED
Power Limit : NVML_SUCCESS
Power Limit Requested : NVML_SUCCESS
Power Draw : NVML_ERROR_NOT_SUPPORTED
Power Draw Average : NVML_ERROR_NOT_SUPPORTED
Power Draw Total : NVML_ERROR_NOT_SUPPORTED
Power Limit : NVML_ERROR_NOT_SUPPORTED
Power Limit Requested : NVML_ERROR_NOT_SUPPORTED
Power Draw : NVML_ERROR_NOT_SUPPORTED
Power Draw Average : NVML_ERROR_NOT_SUPPORTED
Power Draw Total : NVML_ERROR_NOT_SUPPORTED
Power Limit : NVML_ERROR_INVALID_ARGUMENT
Power Limit Requested : NVML_ERROR_INVALID_ARGUMENT
Power Draw Total : NVML_ERROR_NOT_SUPPORTED
Die Temperature : NVML_SUCCESS
Acoustic Threshold : NVML_SUCCESS
Memory Temperature : NVML_ERROR_NOT_SUPPORTED
Thermal Settings Meta : NVML_SUCCESS
Speed Target : NVML_SUCCESS
Speed Target : NVML_SUCCESS
Speed : NVML_SUCCESS
Speed : NVML_SUCCESS
Control Policy : NVML_SUCCESS
Control Policy : NVML_SUCCESS
FBC Stats Meta : NVML_SUCCESS
FBC Sessions : NVML_SUCCESS
PCIe Replay Counter : NVML_SUCCESS
PCIe Replay Rollover Counter : NVML_SUCCESS
PCIe L0 To Recovery Counter : NVML_SUCCESS
PCIe Correctable Errors Counter : NVML_SUCCESS
PCIe NAKS Received Counter : NVML_SUCCESS
PCIe Receiver Error Counter : NVML_SUCCESS
PCIe Bad TLP Counter : NVML_SUCCESS
PCIe NAKS Sent Counter : NVML_SUCCESS
PCIe Bad DLLP Counter : NVML_SUCCESS
PCIe Non-Fatal Error Counter : NVML_SUCCESS
PCIe Fatal Error Counter : NVML_SUCCESS
PCIe Unsupported Req Counter : NVML_SUCCESS
PCIe LCRC Error Counter : NVML_SUCCESS
PCIe Lane Error Counter : NVML_SUCCESS
Display Active : NVML_SUCCESS
Display Mode : NVML_SUCCESS
Persistence Mode : NVML_SUCCESS
Accounting Mode : NVML_SUCCESS
Accounting Stats : NVML_SUCCESS
Adaptive Clock : NVML_SUCCESS
Compute Mode : NVML_SUCCESS
GPU Operation Mode : NVML_ERROR_NOT_SUPPORTED
Driver Model : NVML_ERROR_NOT_SUPPORTED
ECC Mode : NVML_ERROR_NOT_SUPPORTED
GPU Processes : NVML_SUCCESS
Encoder Sessions : NVML_SUCCESS

(Note: there are duplicate attributes because my abstraction layer separates contextual information in some cases.)

This looks normal. Encoder sessions is the last thing that gets updated. I'm not sure what else to check.

BlueGoliath commented 4 months ago

This still isn't fixed.

BlueGoliath commented 2 months ago

A version has been given.