PyO3 / pyo3

Rust bindings for the Python interpreter
https://pyo3.rs
Apache License 2.0
12.15k stars 747 forks source link

Reproducable crash with pyo3 and tensorflow #2611

Open dbr opened 2 years ago

dbr commented 2 years ago

Bug Description

We encountered a very similar sounding issue to #2032 #1623 - but "luckily" for us the crash happens quite reliably!

It doesn't seem specific to tensorflow as it seems to crash similarly with some pytorch code

Steps to Reproduce

Create simple project, and virtual env with tensorflow (pandas also included as it was in the original Python snippet, but I don't think it is relevant)

$ cargo new --bin pyo3_tf_crash
$ cd pyo3_tf_crash/
$ python -m venv _venv
$ source _venv/bin/activate
$ pip install tensorflow pandas
$ cat Cargo.toml
[package]
name = "pyo3_tf_crash"
version = "0.1.0"
edition = "2021"

[dependencies]
pyo3 = {version = "0.17.1", features = ["auto-initialize"]}

And finally, src/main.rs

use pyo3::prelude::*;

static CODE: &'static str = r#"import pandas
d = [
    {'status': 'a', 'title': 'title 1'},
    {'status': 'b', 'title': 'title 2'},
]

data = pandas.DataFrame(d)
sentiment_label = data.status.factorize()

maxlen = 30

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense, Dropout, SpatialDropout1D
from tensorflow.keras.layers import Embedding

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(data.title.values)

encoded_docs = tokenizer.texts_to_sequences(data.title.values)
padded_sequence = pad_sequences(encoded_docs, maxlen=maxlen)

vocab_size = len(tokenizer.word_index) + 1

embedding_vector_length = 32
model = Sequential()
model.add(Embedding(vocab_size, embedding_vector_length, input_length=maxlen))
model.add(SpatialDropout1D(0.25))
model.add(LSTM(50, dropout=0.5, recurrent_dropout=0.5))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

history = model.fit(padded_sequence, sentiment_label[0], validation_split=0.2, epochs=2, batch_size=32)
"#;

fn start_py_thread() {
    std::thread::spawn(move || loop {
        use pyo3::prelude::*;
        Python::with_gil(|py| {
            py.run(CODE, None, None).unwrap();
        });
    });
}

fn main() {
    // Acquire then drop GIL
    let gil = Python::acquire_gil();
    std::mem::drop(gil);

    // Start thread doing Python things
    start_py_thread();

    loop {
        println!("Waiting");
        std::thread::sleep(std::time::Duration::from_secs(1));
    }
}

Run the script, and when the model fitting runs for the second time, it will crash (usually a segfault, but I think may occasionally either abort or deadlock)

$ cargo run
   Compiling pyo3_tf_crash v0.1.0 (/home/dbr/Desktop/pyo3_tf_crash)
warning: use of deprecated associated function `pyo3::Python::<'py>::acquire_gil`: prefer Python::with_gil
  --> src/main.rs:56:23
   |
56 |     let gil = Python::acquire_gil();
   |                       ^^^^^^^^^^^
   |
   = note: `#[warn(deprecated)]` on by default

warning: `pyo3_tf_crash` (bin "pyo3_tf_crash") generated 1 warning
    Finished dev [unoptimized + debuginfo] target(s) in 0.77s
     Running `target/debug/pyo3_tf_crash`
Waiting
Waiting
Waiting
Waiting
Waiting
[...]
Waiting
Epoch 1/2
Waiting
Waiting
Waiting
Waiting
Waiting
Waiting
1/1 [==============================] - ETA: 0s - loss: 0.6860 - accuracy: 1.0000Waiting
1/1 [==============================] - 8s 8s/step - loss: 0.6860 - accuracy: 1.0000 - val_loss: 0.7145 - val_accuracy: 0.0000e+00
Epoch 2/2
1/1 [==============================] - 0s 101ms/step - loss: 0.6703 - accuracy: 1.0000 - val_loss: 0.7226 - val_accuracy: 0.0000e+00
Waiting
Epoch 1/2
Segmentation fault

Backtrace

#0  __GI___pthread_mutex_lock (mutex=0x1b0) at ../nptl/pthread_mutex_lock.c:67
#1  0x00007ffff7bc3fac in ?? () from /lib/x86_64-linux-gnu/libpython3.9.so.1.0
#2  0x00007ffff7bc443e in PyEval_AcquireThread () from /lib/x86_64-linux-gnu/libpython3.9.so.1.0
#3  0x00007fffa3328195 in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /home/dbr/Desktop/pyo3_tf_crash/_venv/lib/python3.9/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#4  0x00007fffa3345b5c in pybind11::cpp_function::initialize<pybind11_init__pywrap_tf_session(pybind11::module_&)::{lambda(TF_Operation*, char const*)#25}, pybind11::object, TF_Operation*, char const*, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11_init__pywrap_tf_session(pybind11::module_&)::{lambda(TF_Operation*, char const*)#25}&&, pybind11::object (*)(TF_Operation*, char const*), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) () from /home/dbr/Desktop/pyo3_tf_crash/_venv/lib/python3.9/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#5  0x00007fffa33468fc in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /home/dbr/Desktop/pyo3_tf_crash/_venv/lib/python3.9/site-packages/tensorflow/python/client/_pywrap_tf_session.so
#6  0x00007ffff7b2dc93 in ?? () from /lib/x86_64-linux-gnu/libpython3.9.so.1.0
#7  0x00007ffff7ae7a50 in _PyObject_MakeTpCall () from /lib/x86_64-linux-gnu/libpython3.9.so.1.0
#8  0x00007ffff7a992f2 in _PyEval_EvalFrameDefault () from /lib/x86_64-linux-gnu/libpython3.9.so.1.0
[...]

Your operating system and version

Linux/Debian 11 Bullseye

Your Python version (python --version)

Python 3.9.2

Your Rust version (rustc --version)

rustc 1.62.0

Your PyO3 version

0.16.4 and 0.17

How did you install python? Did you use a virtualenv?

Using system Python via apt, virtualenv used as per repo-steps

Additional Info

Haven't tested the trimmed down reproduction example on Windows, but the code in it's original application crashed on both Linux and Windows

dbr commented 2 years ago

Oh just noticed Python::acquire_gil(); is deprecated in 0.17 - but changing it to Python::with_gil(|_py|{}); crashes identically

davidhewitt commented 2 years ago

Thank you for sharing this - I'll try to have a play with this in the next week or two and see if I can deduce anything.

davidhewitt commented 1 year ago

With apologies it's taken me a long time to get around to testing this. I've just done so on both Windows and Linux with the latest tensorflow and PyO3 main, and I don't get a segfault.

Instead I get a deadlock at this line:

model.add(Embedding(vocab_size, embedding_vector_length, input_length=maxlen))

Just copy-pasting that script twice doesn't deadlock, so I suspect there's some interaction with py.run which is triggering this. We've had some other recent reports of issues with py.run (#2891, #2927). I think this one will need some debugging, I'll try and take another look soon.

acertain commented 1 year ago

I was having a problem that looks like this, fixed by importing pytorch on main rust thread/before spawning threads.