Closed Alexander-N closed 4 years ago
When developing pyO3 on Windows with 3.8 the test wobbles for me occasionally too.
This is a real head-scratcher. I would expect possible "flakiness" for the hypothesis-based tests based on rare edge-cases, but test_datetime_check
should be entirely deterministic (and not dependent on platform, either). That test just constructs a few datetime
subclasses and calls the "check" functions on them, pretty dead simple.
Maybe we should start running the tests with RUST_BACKTRACE=1
, so we can get a bit more info in hard-to-repro cases like this?
My immediate possible thoughts are some sort of thread-safety issue or possibly another test setting the Python exception state and returning without failing to check it (which would then raise an exception in whatever the next thing to run is that checks the exception state).
One possibility is that this is bleeding over from some issue with these out-of-bounds tests, but that shouldn't be Windows-specific.
I had this fail on me on CI earlier today. I think it was in a Linux environment?!
I had this fail on me on CI earlier today. I think it was in a Linux environment?!
I don't think so - maybe WSL or something, but a lot of the paths use \
as a separator.
I can't really make heads or tails of that backtrace. I guess what's happening is that something in _get_subclasses
is breaking, but PyResult
is Result<..., ()>
, so ?
doesn't seem to propagate anything about the inner error (not even the line it's on :().
I assume it's some sort of thread-safety issue. Maybe py.import("datetime")?
isn't as thread-safe as we thought? That's the only thing in there that would obviously mutate any global variable.
Anyone with a Windows machine want to try and make this slightly more reproducible? It might help to use a Barrier in all the different test functions to try and more readily induce race conditions.
We might be able to make this less flaky by locking around _get_subclasses
or around whatever subset of the test has the race condition, but it's a bit hard to confidently make a change like that without being able to reproduce the bug. Plus it would be good to know if the root cause is something to be addressed in PyO3 or something to warn people about in user code or something.
but
PyResult
isResult<..., ()>
PyResult
is Result<..., PyErr>
, but the test has a map_err(|e| e.print(py))
which is printing and throwing away the PyErr
.
Frustratingly I can't see anywhere in that output the result of e.print()
- maybe Python wrote to its output buffer but did not flush?!
Anyone with a Windows machine want to try and make this slightly more reproducible?
I'm tempted to run the test suite in a loop on my machine this morning and see how long it takes to get a failure 😅
I ran it for a bit and added some debug printing too.
There seems to be a race going on where multiple tests are making progress at the same time. This seems very odd given that all of the tests are supposedly acquiring the GIL... ?
running 4 tests
{'timedelta': <class 'datetime.timedelta'>}
{'date': <class 'datetime.date'>}
{'datetime': <class 'datetime.datetime'>}
{'time': <class 'datetime.time'>}
{'timedelta': <class 'datetime.timedelta'>, 'Subklass': <class '__main__.Subklass'>}
{'date': <class 'datetime.date'>, 'Subklass': <class '__main__.Subklass'>}
test test_delta_check ... {'datetime': <class 'datetime.datetime'>, 'Subklass': <class '__main__.Subklass'>}
ok
{'time': <class 'datetime.time'>, 'Subklass': <class '__main__.Subklass'>}
test test_date_check ... FAILED
test test_datetime_check ... FAILED
test test_time_check ... FAILED
Gotta head to work now, but I might pick this up again at a later date...
...might be related to #288?
There seems to be a race going on where multiple tests are making progress at the same time. This seems very odd given that all of the tests are supposedly acquiring the GIL... ?
This is an easy mistake to make (I made it when first building the datetime
support), but when you create a GILGuard
object with acquire_gil
, you don't actually hold the GIL for the lifetime of object - what is guaranteed is that whenever your Rust code is executing you hold the GIL. Python and C functions are free to release the GIL, and in fact they will do so any time they, among other things, do an I/O operation like attempting to import a module.
This is important to keep in mind because it can lead to deadlocks if you attempt to layer on additional locking mechanisms - see this comment about how we avoid deadlocks in datetime.
With regard to "multiple tests making progress at the same time", I think that's perfectly fine, but it seems like maybe the issue is that in _get_subclasses
we're overwriting what Subklass
means (I'm not sure what namespace independent calls to py.run
use, but it looks like maybe it's __main__
)?
If that's the problem, I'm kinda surprised we haven't seen this problem before now, and that it's only happening on Windows machines. If that turns out to be the issue, it should be pretty simple to either give these things unique names or put them in a unique namespace.
but it seems like maybe the issue is that in
_get_subclasses
we're overwriting whatSubklass
means (I'm not sure what namespace independent calls topy.run
use, but it looks like maybe it's__main__
)?If that's the problem, I'm kinda surprised we haven't seen this problem before now, and that it's only happening on Windows machines. If that turns out to be the issue, it should be pretty simple to either give these things unique names or put them in a unique namespace.
I tried to make the names unique for each test: https://github.com/Alexander-N/pyo3/pull/3/files but the test still fails https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=420934780.
I tried to make the names unique for each test: https://github.com/Alexander-N/pyo3/pull/3/files but the test still fails https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=420934780.
Strange. Also in the exception message I can't see the traceback being printed out, might need to run the tests with PYTHONUNBUFFERED=1
environment variable?
Not sure if this is related to buffering, sometimes I get a NameError like:
NameError: name 'SubSubklassdatetime' is not defined
https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=421093043
And sometimes not: https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=421092950
Ah yeah, probably not buffering. It's test_date_check
which failed in the one without a traceback, which doesn't have an e.print()
statement in it.
I wonder, could it be some kind of ownership issue where these things have a zero reference count and are getting GC'd out from underneath us? I don't see how that could be possible though...
I wonder, could it be some kind of ownership issue where these things have a zero reference count and are getting GC'd out from underneath us? I don't see how that could be possible though...
I think that's exactly whats happening. In my branch the error is also reproducible on Linux and does disappear if I comment out the releasing of owned objects when the GILGuard is dropped:
pub unsafe fn drain(&mut self, _py: Python, owned: usize, borrowed: usize, pointers: bool) {
// Release owned objects(call decref)
// while owned < self.owned.len() {
// let last = self.owned.pop_back().unwrap();
// ffi::Py_DECREF(last.as_ptr());
// }
// Release borrowed objects(don't call decref)
self.borrowed.truncate(borrowed);
if pointers {
self.release_pointers();
}
self.obj.clear();
}
Depending on the timing of the thread exits I either get a NameError or SIGSEVs like these:
running 8 tests
[New Thread 0x7ffff76e7700 (LWP 3538)]
[New Thread 0x7fffef4e6700 (LWP 3539)]
[New Thread 0x7ffff6f22700 (LWP 3540)]
[New Thread 0x7ffff6ce1700 (LWP 3541)]
[New Thread 0x7ffff6ae0700 (LWP 3542)]
[New Thread 0x7ffff68df700 (LWP 3543)]
[New Thread 0x7ffff66de700 (LWP 3544)]
[New Thread 0x7ffff64dd700 (LWP 3545)]
subclass ok
test test_datetime_utc ... ok
[Thread 0x7ffff6f22700 (LWP 3540) exited]
test test_time_check ... FAILED
[Thread 0x7ffff64dd700 (LWP 3545) exited]
subclass ok
subsubclass ok
instance subclass ok
test test_pytime_out_of_bounds ... ok
[Thread 0x7ffff66de700 (LWP 3544) exited]
test test_date_check ... FAILED
[Thread 0x7ffff76e7700 (LWP 3538) exited]
subclass ok
test test_pydate_out_of_bounds ... ok
subclass ok
[Thread 0x7ffff6ae0700 (LWP 3542) exited]
subsubclass ok
Thread 5 "test_delta_chec" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6ce1700 (LWP 3541)]
_int_free (av=0x7fffe4000020, p=0x7fffe4003ec0, have_lock=<optimized out>) at malloc.c:4341
4341 malloc.c: No such file or directory.
running 8 tests
[New Thread 0x7ffff76e7700 (LWP 3577)]
[New Thread 0x7fffeffff700 (LWP 3578)]
[New Thread 0x7ffff6f22700 (LWP 3579)]
[New Thread 0x7ffff6ce1700 (LWP 3580)]
[New Thread 0x7ffff6ae0700 (LWP 3581)]
[New Thread 0x7ffff68df700 (LWP 3582)]
[New Thread 0x7ffff66de700 (LWP 3583)]
[New Thread 0x7ffff64dd700 (LWP 3584)]
test test_pytime_out_of_bounds ... ok
[Thread 0x7ffff66de700 (LWP 3583) exited]
test test_datetime_utc ... ok
[Thread 0x7ffff6f22700 (LWP 3579) exited]
test test_pydate_out_of_bounds ... ok
[Thread 0x7ffff6ae0700 (LWP 3581) exited]
test test_pydatetime_out_of_bounds ... [Thread 0x7ffff68df700 (LWP 3582) exited]
ok
subclass ok
subsubclass ok
instance subclass ok
instance subsubclass ok
test test_datetime_check ... ok
[Thread 0x7fffeffff700 (LWP 3578) exited]
subclass ok
subclass ok
subclass ok
subsubclass ok
subsubclass ok
subsubclass ok
instance subclass ok
instance subclass ok
instance subclass ok
instance subsubclass ok
instance subsubclass ok
test test_date_check ... ok
[Thread 0x7ffff76e7700 (LWP 3577) exited]
Thread 9 "test_time_check" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff64dd700 (LWP 3584)]
0x000055555556eab9 in pyo3::ffi::object::Py_TYPE (ob=0x0) at /home/a/projects/pyo3/src/ffi/object.rs:91
91 (*ob).ob_type
Oh crikey. I guess what must be happening is that when the first test finishes and drops its acquired GILGuard
, the entire release pool gets drained even though other tests / threads are still reliant on the owned pointers in the pool.
A fix could potentially be to make the release pools thread-local? Or to have some kind of reference counting? If we eliminate the release pool as part of #679 then that might fix this too.
The error seems to occur only if python modules are imported. I think what happens is something along the lines of: The imports release the GIL and the wrong object is popped from the release pool.
For example this reproduces it fairly consistently for me:
#[test]
fn test_release_race1() {
let gil = Python::acquire_gil();
let py = gil.python();
println!("make ob1");
let ob1 = py.eval("object()", None, None).unwrap();
py.import("datetime").unwrap();
py_run!(py, ob1, "print('ob1' + str(ob1))");
}
#[test]
fn test_release_race2() {
let gil = Python::acquire_gil();
let py = gil.python();
println!("make ob2");
let ob2 = py.eval("object()", None, None).unwrap();
py.import("datetime").unwrap();
py_run!(py, ob2, "print('ob2' + str(ob2))");
}
After reading https://github.com/PyO3/pyo3/issues/720#issuecomment-578217918 again, I'll try to make the idea clearer:
Does this sound plausible?
This sounds exactly like what I'm thinking must be going on. Presumably it would be possible to verify this ordering by adding a few more println!
statements to your example.
Though I think:
test_release_race1 finishes and with dropping of it's GILGuard ob2 instead of ob1 is popped from the relase pool
What actually happens is that both ob2
and ob1
are being freed.
Though I think:
test_release_race1 finishes and with dropping of it's GILGuard ob2 instead of ob1 is popped from the relase pool
What actually happens is that both
ob2
andob1
are being freed.
I thought that if there was one owned object in the release pool when the second test acquires it's GILGuard then owned==1
and the while owned < self.owned.len()
condition is in place to ensure that there is still one owned object after draining the release pool.
while owned < self.owned.len() {
let last = self.owned.pop_back().unwrap();
ffi::Py_DECREF(last.as_ptr());
}
Ah, sorry, yes, you're right! I should have read that code closer 😄
@Alexander-N @davidhewitt thank you for useful discussions!
I think what happens is something along the lines of: The imports release the GIL and the wrong object is popped from the release pool.
OMG, it sounds serious problem to me :thinking:
Agreed! This seems like another reason to consider replacing the release pool with a better solution. (Along with #679)
This is quite a long thread now and it's sorta misleadingly titled since "Flaky datetime test" is just the proximate symptom, not the root problem to be solved. Might I suggest the following course of action:
_get_subclasses()
function, so that the datetime
test is no longer flaky, with a // TODO
comment to remove it when the problem is fixed.That way people won't get hit with erroneous failures, and if we lose momentum on solving this issue, there will be a cleaner "entry point" for new contributors rather than reading through a long thread ostensibly about a flaky test.
Good work everyone!
Good point, I opened #756.
I believe a workaround was introduced in #867, so I will close this for now and leave #756 open to track the fundamental issue.
The test
test_datetime_check
fails from time to time with an error likeIt could be that this occurs only on Windows with Python 3.6: https://github.com/PyO3/pyo3/pull/719/checks?check_run_id=384307686 and https://github.com/PyO3/pyo3/pull/651#issuecomment-546640959.
@pganssle Do you have any idea what could cause this?