NVIDIA / open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
Other
15.06k stars 1.25k forks source link

Fix potential race condition in _rmapiRmControl #656

Open Cwndmiao opened 3 months ago

Cwndmiao commented 3 months ago

` diff --git a/src/nvidia/src/kernel/rmapi/control.c b/src/nvidia/src/kernel/rmapi/control.c index 0ed2e1e7..140386ba 100644 --- a/src/nvidia/src/kernel/rmapi/control.c +++ b/src/nvidia/src/kernel/rmapi/control.c @@ -427,13 +427,6 @@ _rmapiRmControl(NvHandle hClient, NvHandle hObject, NvU32 cmd, NvP64 pUserParams } }

serverutilGetClientUnderLock() in _rmapiRmControl() is useless, and may cause race condition if someone else is creating/destroying RmClient simultaneously.

659d989b9690c9bb8672c5ff63d5ea5a

mtijanic commented 3 months ago

Thanks for pointing this out! While not entirely useless (it validates hClient) that call is quite unfortunate, both in the race condition/crash potential and in that it performs a somewhat expensive lookup only to discard the result.

I'm not sure simply removing it is the right way to go, as then apps can poke around the state quite a bit without allocating the client (e.g. you could get all the cached data).

So while this path definitely needs to be refactored, I'm gonna have to dwell on this for a bit and get back to you in a few days. Thanks again!

CLAassistant commented 3 months ago

CLA assistant check
All committers have signed the CLA.

mtijanic commented 3 months ago

Hey, I think we'll merge this change as-is, and then maybe handle the rest as a separate thing, or not at all. I'll start the process of applying it internally, but it will likely only show up in r565.xx release. This PR will have to stay open until then. Sorry for the slow and arcane process.

And thanks again for the PR, this is really appreciated! Anything else you come across, please let us know, as PR or bug report.

Cwndmiao commented 3 months ago

Thanks for your timely reply :)

mtijanic commented 2 months ago

Unfortunately, had to revert it from 565.xx because it consistently breaks certain Windows tests. Filed internal bug 4749826 to root cause that, and we'll re-apply the change when the issue is understood.

(apparently something in Windows usermode depends on this behavior. Possibly it sends a control that is invalid in multiple ways - e.g. bad hClient and bad parameters - and it expects a specific error status. With this change, if there are multiple problems with a call, we might fail for other reasons before returning NV_ERR_INVALID_CLIENT)