apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
321 stars 64 forks source link

fix `lib.name` in `Cargo.toml` #694

Closed samuelcolvin closed 1 month ago

samuelcolvin commented 1 month ago

With the current settings maturin build and pip install . gave me an error when I tried to import datafusion:

Traceback (most recent call last):
  File "/Users/samuel/code/datafusion-python/demo.py", line 1, in <module>
    from datafusion import SessionContext
  File "/Users/samuel/code/datafusion-python/datafusion/__init__.py", line 28, in <module>
    from ._internal import (
ModuleNotFoundError: No module named 'datafusion._internal'
Michael-J-Ward commented 1 month ago

Note for reproducing: make sure to remove the .*so file that maturin develop places next to ./datafusion/__init__.py


@samuelcolvin, I assume this is an instance of this issue that the maturin docs warns about:

This structure is recommended to avoid a common ImportError pitfall

The docs recommend nesting the python source as python/datafusion/, and the only references I find about lib.name recommend keeping it the same.

To create a mixed Rust/Python project, add a directory with your package name (i.e. matching lib.name in your Cargo.toml) to contain the Python source:

Most importantly, my first attempt at using your fix did not appear to fix it (see below)

Do you have any thoughts on this solution vs the recommended one? (I'm still learning about maturin).


❯ maturin build
⚠  Warning: You specified maturin >=0.15, <0.16 in pyproject.toml under `build-system.requires`, but the current maturin version is 1.4.0
🍹 Building a mixed python/rust project
🔗 Found pyo3 bindings with abi3 support for Python ≥ 3.8
🐍 Not using a specific python interpreter
📡 Using build options features, locked from pyproject.toml
   Compiling pyo3-build-config v0.20.2
   Compiling pyo3-ffi v0.20.2
   Compiling pyo3 v0.20.2
   Compiling datafusion-python v37.1.0 (/home/mike/workspace/datafusion-python/flake)
   Compiling arrow v51.0.0
   Compiling datafusion-common v37.1.0
   Compiling datafusion-expr v37.1.0
   Compiling datafusion-execution v37.1.0
   Compiling datafusion-sql v37.1.0
   Compiling datafusion-physical-expr v37.1.0
   Compiling datafusion-functions v37.1.0
   Compiling datafusion-optimizer v37.1.0
   Compiling datafusion-physical-plan v37.1.0
   Compiling datafusion-functions-array v37.1.0
   Compiling datafusion v37.1.0
   Compiling datafusion-substrait v37.1.0
    Finished dev [unoptimized + debuginfo] target(s) in 1m 02s
⚠  Warning: No compatible platform tag found, using the linux tag instead. You won't be able to upload those wheels to PyPI.
📦 Built wheel for abi3 Python ≥ 3.8 to /home/mike/workspace/datafusion-python/flake/target/wheels/datafusion-37.1.0-cp38-abi3-linux_x86_64.whl

flake on  flake [$!?⇕] is 📦 v37.1.0 via 🐍 v3.11.9 (venv) via 🦀 v1.77.1 via ❄  impure (nix-shell-env) on ☁  (us-east-1) took 1m27s 
❯ pip install .
Processing /home/mike/workspace/datafusion-python/flake
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: pyarrow>=11.0.0 in /nix/store/llp7z339ix0lm7nlrndn1yvdmhskcsyk-rust-toolchain/lib/python3.11/site-packages (from datafusion==37.1.0) (15.0.0)
Requirement already satisfied: numpy<2,>=1.16.6 in /nix/store/llp7z339ix0lm7nlrndn1yvdmhskcsyk-rust-toolchain/lib/python3.11/site-packages (from pyarrow>=11.0.0->datafusion==37.1.0) (1.26.4)
Building wheels for collected packages: datafusion
  Building wheel for datafusion (pyproject.toml) ... done
  Created wheel for datafusion: filename=datafusion-37.1.0-cp38-abi3-linux_x86_64.whl size=17875320 sha256=b417c0d9b4662a0f687359a2e751580d4a2fc0714a02b447cfe6deff11138413
  Stored in directory: /tmp/nix-shell.EXRUlg/pip-ephem-wheel-cache-tbyz75qv/wheels/8e/1b/87/2a5f750e961aa47403189d62f5e1236e0819f25e0a9068babf
Successfully built datafusion
Installing collected packages: datafusion
Successfully installed datafusion-37.1.0

flake on  flake [$!?⇕] is 📦 v37.1.0 via 🐍 v3.11.9 (venv) via 🦀 v1.77.1 via ❄  impure (nix-shell-env) on ☁  (us-east-1) took 9m50s 
❯ python
Python 3.11.9 (main, Apr  2 2024, 08:25:04) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datafusion
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mike/workspace/datafusion-python/flake/datafusion/__init__.py", line 28, in <module>
    from ._internal import (
ModuleNotFoundError: No module named 'datafusion._internal'

w/ a git diff of

❯ git diff
diff --git a/Cargo.toml b/Cargo.toml
index 9da36d7..5bb1f83 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -60,7 +60,7 @@ url = "2.2"
 pyo3-build-config = "0.20.0"

 [lib]
-name = "datafusion_python"
+name = "_internal"
 crate-type = ["cdylib", "rlib"]

 [profile.release]
diff --git a/src/lib.rs b/src/lib.rs
index a696ebf..8ce1b16 100644
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -72,6 +72,7 @@ pub(crate) struct TokioRuntime(tokio::runtime::Runtime);
 /// The higher-level public API is defined in pure python files under the
 /// datafusion directory.
 #[pymodule]
+#[pyo3(name="_internal")]
 fn _internal(py: Python, m: &PyModule) -> PyResult<()> {
     // Register the Tokio Runtime as a module attribute so we can reuse it
     m.add(
Michael-J-Ward commented 1 month ago

Following the recommended solution of nesting python/datafusion and adding tool.maturin.python-source = "python" does seem to work for me.



❯ git diff pyproject.toml 
diff --git a/pyproject.toml b/pyproject.toml
index d353605..24bd296 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -55,6 +55,7 @@ repository = "https://github.com/apache/arrow-datafusion-python"
 profile = "black"

 [tool.maturin]
+python-source = "python"
 module-name = "datafusion._internal"
 include = [
     { path = "Cargo.lock", format = "sdist" }

flake on  flake [$✘!?⇕] is 📦 v37.1.0 via 🐍 v3.11.9 (venv) via 🦀 v1.77.1 via ❄  impure (nix-shell-env) on ☁  (us-east-1) 
❯ ls -l python
total 4
drwxr-xr-x 5 mike users 4096 May 14 08:07 datafusion

flake on  flake [$✘!?⇕] is 📦 v37.1.0 via 🐍 v3.11.9 (venv) via 🦀 v1.77.1 via ❄  impure (nix-shell-env) on ☁  (us-east-1) 
❯ maturin build
⚠  Warning: You specified maturin >=0.15, <0.16 in pyproject.toml under `build-system.requires`, but the current maturin version is 1.4.0
🍹 Building a mixed python/rust project
🔗 Found pyo3 bindings with abi3 support for Python ≥ 3.8
🐍 Not using a specific python interpreter
📡 Using build options features, locked from pyproject.toml
    Finished dev [unoptimized + debuginfo] target(s) in 0.17s
⚠  Warning: No compatible platform tag found, using the linux tag instead. You won't be able to upload those wheels to PyPI.
📦 Built wheel for abi3 Python ≥ 3.8 to /home/mike/workspace/datafusion-python/flake/target/wheels/datafusion-37.1.0-cp38-abi3-linux_x86_64.whl

flake on  flake [$✘!?⇕] is 📦 v37.1.0 via 🐍 v3.11.9 (venv) via 🦀 v1.77.1 via ❄  impure (nix-shell-env) on ☁  (us-east-1) took 20s 
❯ pip install .
Processing /home/mike/workspace/datafusion-python/flake
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: pyarrow>=11.0.0 in /nix/store/llp7z339ix0lm7nlrndn1yvdmhskcsyk-rust-toolchain/lib/python3.11/site-packages (from datafusion==37.1.0) (15.0.0)
Requirement already satisfied: numpy<2,>=1.16.6 in /nix/store/llp7z339ix0lm7nlrndn1yvdmhskcsyk-rust-toolchain/lib/python3.11/site-packages (from pyarrow>=11.0.0->datafusion==37.1.0) (1.26.4)
Building wheels for collected packages: datafusion
  Building wheel for datafusion (pyproject.toml) ... done
  Created wheel for datafusion: filename=datafusion-37.1.0-cp38-abi3-linux_x86_64.whl size=17875125 sha256=921aca268be398c639695a5b851e14b3a19185ce39f2336d736a5f5705841b01
  Stored in directory: /tmp/nix-shell.e0IZLi/pip-ephem-wheel-cache-lzlmoc66/wheels/8e/1b/87/2a5f750e961aa47403189d62f5e1236e0819f25e0a9068babf
Successfully built datafusion
Installing collected packages: datafusion
  Attempting uninstall: datafusion
    Found existing installation: datafusion 37.1.0
    Uninstalling datafusion-37.1.0:
      Successfully uninstalled datafusion-37.1.0
Successfully installed datafusion-37.1.0

flake on  flake [$✘!?⇕] is 📦 v37.1.0 via 🐍 v3.11.9 (venv) via 🦀 v1.77.1 via ❄  impure (nix-shell-env) on ☁  (us-east-1) took 4s 
❯ python 
Python 3.11.9 (main, Apr  2 2024, 08:25:04) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import datafusion
>>> datafusion.__version__
'37.1.0'
samuelcolvin commented 1 month ago

@davidhewitt, you're the maturin expert - can you tell us what to do.

davidhewitt commented 1 month ago

Yes I suggest moving python sources into a subdirectory as @Michael-J-Ward suggests - the ImportError pitfall described in the maturin docs is a common one and the subdirectory is my preferred workaround (we did the same in pydantic-core).

samuelcolvin commented 1 month ago

makes sense. I think this started working because I had run maturin develop before making this change.

davidhewitt commented 1 month ago

In particular the problem is that maturin develop (or pip install -e .) will use the checkout's Python sources, but pip install . will put a copy into site-packages, which then conflicts with the checkout sources if you're running Python from the repository root. (You'll end up importing from the local sources rather than the installed contents in site-packages.)

Michael-J-Ward commented 1 month ago

Gracias for the report @samuelcolvin and the knowledge @davidhewitt.

Superseding PR in #695