PyO3 / pyo3

Rust bindings for the Python interpreter
https://pyo3.rs
Apache License 2.0
12.17k stars 751 forks source link

Flaky datetime test #720

Closed Alexander-N closed 4 years ago

Alexander-N commented 4 years ago

The test test_datetime_check fails from time to time with an error like

---- test_datetime_check stdout ----
thread 'test_datetime_check' panicked at 'called `Result::unwrap()` on an `Err` value: PyErr { type: Py(0x60d15300, PhantomData) }', src\libcore\result.rs:1192:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.

It could be that this occurs only on Windows with Python 3.6: https://github.com/PyO3/pyo3/pull/719/checks?check_run_id=384307686 and https://github.com/PyO3/pyo3/pull/651#issuecomment-546640959.

@pganssle Do you have any idea what could cause this?

davidhewitt commented 4 years ago

When developing pyO3 on Windows with 3.8 the test wobbles for me occasionally too.

pganssle commented 4 years ago

This is a real head-scratcher. I would expect possible "flakiness" for the hypothesis-based tests based on rare edge-cases, but test_datetime_check should be entirely deterministic (and not dependent on platform, either). That test just constructs a few datetime subclasses and calls the "check" functions on them, pretty dead simple.

Maybe we should start running the tests with RUST_BACKTRACE=1, so we can get a bit more info in hard-to-repro cases like this?

My immediate possible thoughts are some sort of thread-safety issue or possibly another test setting the Python exception state and returning without failing to check it (which would then raise an exception in whatever the next thing to run is that checks the exception state).

One possibility is that this is bleeding over from some issue with these out-of-bounds tests, but that shouldn't be Windows-specific.

davidhewitt commented 4 years ago

I had this fail on me on CI earlier today. I think it was in a Linux environment?!

https://github.com/davidhewitt/pyo3/commit/43df3299abc3cfdf006914c44d5b7f3d0a987f6c/checks?check_suite_id=414584878

pganssle commented 4 years ago

I had this fail on me on CI earlier today. I think it was in a Linux environment?!

I don't think so - maybe WSL or something, but a lot of the paths use \ as a separator.

I can't really make heads or tails of that backtrace. I guess what's happening is that something in _get_subclasses is breaking, but PyResult is Result<..., ()>, so ? doesn't seem to propagate anything about the inner error (not even the line it's on :().

I assume it's some sort of thread-safety issue. Maybe py.import("datetime")? isn't as thread-safe as we thought? That's the only thing in there that would obviously mutate any global variable.

Anyone with a Windows machine want to try and make this slightly more reproducible? It might help to use a Barrier in all the different test functions to try and more readily induce race conditions.

We might be able to make this less flaky by locking around _get_subclasses or around whatever subset of the test has the race condition, but it's a bit hard to confidently make a change like that without being able to reproduce the bug. Plus it would be good to know if the root cause is something to be addressed in PyO3 or something to warn people about in user code or something.

davidhewitt commented 4 years ago

but PyResult is Result<..., ()>

PyResult is Result<..., PyErr>, but the test has a map_err(|e| e.print(py)) which is printing and throwing away the PyErr.

Frustratingly I can't see anywhere in that output the result of e.print() - maybe Python wrote to its output buffer but did not flush?!

davidhewitt commented 4 years ago

Anyone with a Windows machine want to try and make this slightly more reproducible?

I'm tempted to run the test suite in a loop on my machine this morning and see how long it takes to get a failure 😅

davidhewitt commented 4 years ago

I ran it for a bit and added some debug printing too.

There seems to be a race going on where multiple tests are making progress at the same time. This seems very odd given that all of the tests are supposedly acquiring the GIL... ?

running 4 tests
{'timedelta': <class 'datetime.timedelta'>}
{'date': <class 'datetime.date'>}
{'datetime': <class 'datetime.datetime'>}
{'time': <class 'datetime.time'>}
{'timedelta': <class 'datetime.timedelta'>, 'Subklass': <class '__main__.Subklass'>}
{'date': <class 'datetime.date'>, 'Subklass': <class '__main__.Subklass'>}
test test_delta_check ... {'datetime': <class 'datetime.datetime'>, 'Subklass': <class '__main__.Subklass'>}
ok
{'time': <class 'datetime.time'>, 'Subklass': <class '__main__.Subklass'>}
test test_date_check ... FAILED
test test_datetime_check ... FAILED
test test_time_check ... FAILED

Gotta head to work now, but I might pick this up again at a later date...

davidhewitt commented 4 years ago

...might be related to #288?

pganssle commented 4 years ago

There seems to be a race going on where multiple tests are making progress at the same time. This seems very odd given that all of the tests are supposedly acquiring the GIL... ?

This is an easy mistake to make (I made it when first building the datetime support), but when you create a GILGuard object with acquire_gil, you don't actually hold the GIL for the lifetime of object - what is guaranteed is that whenever your Rust code is executing you hold the GIL. Python and C functions are free to release the GIL, and in fact they will do so any time they, among other things, do an I/O operation like attempting to import a module.

This is important to keep in mind because it can lead to deadlocks if you attempt to layer on additional locking mechanisms - see this comment about how we avoid deadlocks in datetime.

With regard to "multiple tests making progress at the same time", I think that's perfectly fine, but it seems like maybe the issue is that in _get_subclasses we're overwriting what Subklass means (I'm not sure what namespace independent calls to py.run use, but it looks like maybe it's __main__)?

If that's the problem, I'm kinda surprised we haven't seen this problem before now, and that it's only happening on Windows machines. If that turns out to be the issue, it should be pretty simple to either give these things unique names or put them in a unique namespace.

Alexander-N commented 4 years ago

but it seems like maybe the issue is that in _get_subclasses we're overwriting what Subklass means (I'm not sure what namespace independent calls to py.run use, but it looks like maybe it's __main__)?

If that's the problem, I'm kinda surprised we haven't seen this problem before now, and that it's only happening on Windows machines. If that turns out to be the issue, it should be pretty simple to either give these things unique names or put them in a unique namespace.

I tried to make the names unique for each test: https://github.com/Alexander-N/pyo3/pull/3/files but the test still fails https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=420934780.

davidhewitt commented 4 years ago

I tried to make the names unique for each test: https://github.com/Alexander-N/pyo3/pull/3/files but the test still fails https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=420934780.

Strange. Also in the exception message I can't see the traceback being printed out, might need to run the tests with PYTHONUNBUFFERED=1 environment variable?

Alexander-N commented 4 years ago

Not sure if this is related to buffering, sometimes I get a NameError like:

NameError: name 'SubSubklassdatetime' is not defined

https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=421093043

And sometimes not: https://github.com/Alexander-N/pyo3/pull/3/checks?check_run_id=421092950

davidhewitt commented 4 years ago

Ah yeah, probably not buffering. It's test_date_check which failed in the one without a traceback, which doesn't have an e.print() statement in it.

I wonder, could it be some kind of ownership issue where these things have a zero reference count and are getting GC'd out from underneath us? I don't see how that could be possible though...

Alexander-N commented 4 years ago

I wonder, could it be some kind of ownership issue where these things have a zero reference count and are getting GC'd out from underneath us? I don't see how that could be possible though...

I think that's exactly whats happening. In my branch the error is also reproducible on Linux and does disappear if I comment out the releasing of owned objects when the GILGuard is dropped:

    pub unsafe fn drain(&mut self, _py: Python, owned: usize, borrowed: usize, pointers: bool) {
        // Release owned objects(call decref)
        // while owned < self.owned.len() {
        //     let last = self.owned.pop_back().unwrap();
        //     ffi::Py_DECREF(last.as_ptr());
        // }
        // Release borrowed objects(don't call decref)
        self.borrowed.truncate(borrowed);

        if pointers {
            self.release_pointers();
        }

        self.obj.clear();
    }

Depending on the timing of the thread exits I either get a NameError or SIGSEVs like these:

running 8 tests
[New Thread 0x7ffff76e7700 (LWP 3538)]
[New Thread 0x7fffef4e6700 (LWP 3539)]
[New Thread 0x7ffff6f22700 (LWP 3540)]
[New Thread 0x7ffff6ce1700 (LWP 3541)]
[New Thread 0x7ffff6ae0700 (LWP 3542)]
[New Thread 0x7ffff68df700 (LWP 3543)]
[New Thread 0x7ffff66de700 (LWP 3544)]
[New Thread 0x7ffff64dd700 (LWP 3545)]
subclass ok
test test_datetime_utc ... ok
[Thread 0x7ffff6f22700 (LWP 3540) exited]
test test_time_check ... FAILED
[Thread 0x7ffff64dd700 (LWP 3545) exited]
subclass ok
subsubclass ok
instance subclass ok
test test_pytime_out_of_bounds ... ok
[Thread 0x7ffff66de700 (LWP 3544) exited]
test test_date_check ... FAILED
[Thread 0x7ffff76e7700 (LWP 3538) exited]
subclass ok
test test_pydate_out_of_bounds ... ok
subclass ok
[Thread 0x7ffff6ae0700 (LWP 3542) exited]
subsubclass ok

Thread 5 "test_delta_chec" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6ce1700 (LWP 3541)]
_int_free (av=0x7fffe4000020, p=0x7fffe4003ec0, have_lock=<optimized out>) at malloc.c:4341
4341    malloc.c: No such file or directory.

running 8 tests
[New Thread 0x7ffff76e7700 (LWP 3577)]
[New Thread 0x7fffeffff700 (LWP 3578)]
[New Thread 0x7ffff6f22700 (LWP 3579)]
[New Thread 0x7ffff6ce1700 (LWP 3580)]
[New Thread 0x7ffff6ae0700 (LWP 3581)]
[New Thread 0x7ffff68df700 (LWP 3582)]
[New Thread 0x7ffff66de700 (LWP 3583)]
[New Thread 0x7ffff64dd700 (LWP 3584)]
test test_pytime_out_of_bounds ... ok
[Thread 0x7ffff66de700 (LWP 3583) exited]
test test_datetime_utc ... ok
[Thread 0x7ffff6f22700 (LWP 3579) exited]
test test_pydate_out_of_bounds ... ok
[Thread 0x7ffff6ae0700 (LWP 3581) exited]
test test_pydatetime_out_of_bounds ... [Thread 0x7ffff68df700 (LWP 3582) exited]
ok
subclass ok
subsubclass ok
instance subclass ok
instance subsubclass ok
test test_datetime_check ... ok
[Thread 0x7fffeffff700 (LWP 3578) exited]
subclass ok
subclass ok
subclass ok
subsubclass ok
subsubclass ok
subsubclass ok
instance subclass ok
instance subclass ok
instance subclass ok
instance subsubclass ok
instance subsubclass ok
test test_date_check ... ok
[Thread 0x7ffff76e7700 (LWP 3577) exited]

Thread 9 "test_time_check" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff64dd700 (LWP 3584)]
0x000055555556eab9 in pyo3::ffi::object::Py_TYPE (ob=0x0) at /home/a/projects/pyo3/src/ffi/object.rs:91
91      (*ob).ob_type
davidhewitt commented 4 years ago

Oh crikey. I guess what must be happening is that when the first test finishes and drops its acquired GILGuard, the entire release pool gets drained even though other tests / threads are still reliant on the owned pointers in the pool.

A fix could potentially be to make the release pools thread-local? Or to have some kind of reference counting? If we eliminate the release pool as part of #679 then that might fix this too.

Alexander-N commented 4 years ago

The error seems to occur only if python modules are imported. I think what happens is something along the lines of: The imports release the GIL and the wrong object is popped from the release pool.

For example this reproduces it fairly consistently for me:

#[test]
fn test_release_race1() {
    let gil = Python::acquire_gil();
    let py = gil.python();
    println!("make ob1");
    let ob1 = py.eval("object()", None, None).unwrap();
    py.import("datetime").unwrap();
    py_run!(py, ob1, "print('ob1' + str(ob1))");
}

#[test]
fn test_release_race2() {
    let gil = Python::acquire_gil();
    let py = gil.python();
    println!("make ob2");
    let ob2 = py.eval("object()", None, None).unwrap();
    py.import("datetime").unwrap();
    py_run!(py, ob2, "print('ob2' + str(ob2))");
}
Alexander-N commented 4 years ago

After reading https://github.com/PyO3/pyo3/issues/720#issuecomment-578217918 again, I'll try to make the idea clearer:

Does this sound plausible?

davidhewitt commented 4 years ago

This sounds exactly like what I'm thinking must be going on. Presumably it would be possible to verify this ordering by adding a few more println! statements to your example.

davidhewitt commented 4 years ago

Though I think:

test_release_race1 finishes and with dropping of it's GILGuard ob2 instead of ob1 is popped from the relase pool

What actually happens is that both ob2 and ob1 are being freed.

Alexander-N commented 4 years ago

Though I think:

test_release_race1 finishes and with dropping of it's GILGuard ob2 instead of ob1 is popped from the relase pool

What actually happens is that both ob2 and ob1 are being freed.

I thought that if there was one owned object in the release pool when the second test acquires it's GILGuard then owned==1 and the while owned < self.owned.len() condition is in place to ensure that there is still one owned object after draining the release pool.

        while owned < self.owned.len() {
            let last = self.owned.pop_back().unwrap();
            ffi::Py_DECREF(last.as_ptr());
        }
davidhewitt commented 4 years ago

Ah, sorry, yes, you're right! I should have read that code closer 😄

kngwyu commented 4 years ago

@Alexander-N @davidhewitt thank you for useful discussions!

I think what happens is something along the lines of: The imports release the GIL and the wrong object is popped from the release pool.

OMG, it sounds serious problem to me :thinking:

davidhewitt commented 4 years ago

Agreed! This seems like another reason to consider replacing the release pool with a better solution. (Along with #679)

pganssle commented 4 years ago

This is quite a long thread now and it's sorta misleadingly titled since "Flaky datetime test" is just the proximate symptom, not the root problem to be solved. Might I suggest the following course of action:

  1. Temporarily add a lock around the contents of the _get_subclasses() function, so that the datetime test is no longer flaky, with a // TODO comment to remove it when the problem is fixed.
  2. Spawn a new issue with a succinct summary of the results of this thread and with the MWE that consistently reproduces the problem.

That way people won't get hit with erroneous failures, and if we lose momentum on solving this issue, there will be a cleaner "entry point" for new contributors rather than reading through a long thread ostensibly about a flaky test.

Good work everyone!

Alexander-N commented 4 years ago

Good point, I opened #756.

davidhewitt commented 4 years ago

I believe a workaround was introduced in #867, so I will close this for now and leave #756 open to track the fundamental issue.