Fix race condition around subprocess inference state tidyup

PeterJCLaw commented 3 months ago

Summary

There was a race condition due to the combination of Python's object ids being re-usable and Jedi persisting such ids beyond the real lifeteime of some objects. This could lead to the subprocess' view of the lifetime of InferenceState contexts getting out of step with that in the parent process and resulting in errors when removing them. It is also possible that this could result in erroneous results being reported, however this was not directly observed.

The race was specifically:

InferenceState A created, gets id 1
InferenceStateSubprocess A' created, uses InferenceState A which it stores as a weakref and an id
InferenceStateSubprocess A' is used, the sub-process learns about an InferenceState with id 1
InferenceState A goes away, InferenceStateSubprocess A' is not yet garbage collected
InferenceState B created, gets id 1
InferenceStateSubprocess B' created, uses InferenceState B which it stores as a weakref and an id
InferenceStateSubprocess B' is used, the sub-process re-uses its entry for an InferenceState with id 1

At this point the order of operations between the two InferenceStateSubprocess instances going away is immaterial -- both will trigger a removal of a state with id 1. As long as B' doesn't try to use the sub-process again after the first removal has happened then the second removal will fail.

This change resolves the race condition by coupling the context in the subprocess to the corresponding manager class instance in the parent process, rather than to the consumer InferenceState.

See inline comments for further details.

Code review

Suggest review by commit. There's some small refactors in the early commits. The actual fix is the last commit on the branch.

Testing

I'm open to ideas on how to add a test for this.

Currently this is relying on Python 3.13 tests reproducing the issue fairly easily, though not entirely consistently.

Links

Currently builds on https://github.com/davidhalter/jedi/pull/2003 (as it was discovered there), though it doesn't really need to be part of the upgrade process.

PeterJCLaw commented 3 months ago

Thanks :D

For clarity: I'm keeping this on hold as I've started seeing test failures on my python3.13 branch having rebased on top of this branch (and the changes here now rebased on master). I'm not sure of the cause just yet as the failures are new and didn't show up in https://github.com/PeterJCLaw/jedi/pull/3 despite that PR ought to having been functionally the same code. The only difference between the two I think is my linting PR #2006, which I hadn't expected to change anything. I'll hopefully have some time this week to pin down the issue.

PeterJCLaw commented 3 months ago

For clarity: I'm keeping this on hold as I've started seeing test failures on my python3.13 branch having rebased on top of this branch (and the changes here now rebased on master). I'm not sure of the cause just yet as the failures are new and didn't show up in PeterJCLaw#3 despite that PR ought to having been functionally the same code.

I believe these are down to some changes in Python 3.13.0-beta.3 (where previously CI was using beta.2), and I'm happy that that's not related to the changes in this PR.

davidhalter commented 3 months ago

That makes sense. The pull request itself feels very simple and I am not surprised it's not your fault :)

davidhalter / jedi