apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

How do I bring dependencies in my binding? #737

Open dariocurr opened 1 week ago

dariocurr commented 1 week ago

Hi guys, I'm Dario. I have been struggling with an issue and I am trying to understand it.

I am trying to create my own cross-language library on top of datafusion and datafusion-python. Let's call this library my-library.

I created a rust workspace and I have two crates:

my-library has datafusion as a dependency and has just one function returning a datafusion::execution::context::SessionContext

my-library-python, has datafusion-python as a dependency and has just one function wrapping the datafusion::execution::context::SessionContext in a datafusion_python::context::PySessionContext

Now. when I install my-library-python in my python env through maturin develop and try to play with the SessionContext returned by the binding as follows:

python
ctx = my_library.get_context()
datafusion_df = ctx.sql(query)

I get the following error

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: PyErr { type: <class 'ModuleNotFoundError'>, value: ModuleNotFoundError("No module named 'datafusion'"), traceback: None }

My question then is: Why should I add datafusion as a dependency in my python package, duplicating the library? Is there a way to bring the dependency in my binding?

dariocurr commented 1 week ago

The binding is created as follows:

#[pymodule]
fn module(_: Python<'_>, main_module: &PyModule) -> PyResult<()> {
    main_module.add_function(wrap_pyfunction!(get_context, main_module)?)?;
    Ok(())
}
Michael-J-Ward commented 1 week ago

I haven't previously done what you're trying to do, so a minimum github repo to reproduce would be helpful.

This error is the python runtime trying to import datafusion, and it isn't clear to me from your example why / where that would happen.

ModuleNotFoundError("No module named 'datafusion'")
Michael-J-Ward commented 1 week ago

I haven't done any digging to see why the code is like this, but the end-result is that you probably will need datafusion as a python dependency.

https://github.com/apache/datafusion-python/blob/45a684445e25032961a7bb44ced3ce06f5ed9e6d/src/utils.rs#L26-L37

Michael-J-Ward commented 1 week ago

Storing the tokio runtime on the python heap to ensure it only gets created once, which provided performance improvements.

It maybe could be created once per SessionContext or similar, but that would be a decent lift of a refactor.

https://github.com/apache/datafusion-python/pull/341

Michael-J-Ward commented 1 day ago

@andygrove confirmed that your use case is something that datafusion-python should support, and points to datafusion-ballista as an example.

@dariocurr, do you have a repo link to share, so that I can investigate further?

dariocurr commented 2 hours ago

I just created an MRE here my-library-datafusion.

Following the instructions and running:

  1. conda env create
  2. maturin develop
  3. pytest tests

You will get:

ModuleNotFoundError("No module named 'datafusion'")


I really don't know how it should work, I am here to ask and learn from you.

Thank you for your time