datafusion-contrib / datafusion-python

Python binding for DataFusion
https://arrow.apache.org/datafusion/python/index.html
Apache License 2.0
59 stars 12 forks source link

Unable to build #18

Closed mmuru closed 2 years ago

mmuru commented 2 years ago

On MacOS and python 3.7

after clone the project, I tried to build it but failed with following

maturin develop
🍹 Building a mixed python/rust project
💥 maturin failed
  Caused by: Cargo metadata failed. Does your crate compile with `cargo build`?
  Caused by: `cargo metadata` exited with an error:     Updating crates.io index
error: failed to select a version for the requirement `datafusion = "=6.0.0"`
candidate versions found which didn't match: 5.0.0, 4.0.0, 3.0.0, ...
location searched: crates.io index

Please, let me know how to fix the build issue.

@houqp: I noticed that README.md points the previous repo and it should be updated.

mmuru commented 2 years ago

@houqp & @Jimexist: Can you help me to unblock this build issue? Thanks.

houqp commented 2 years ago

Did you have a local path override for the datafusion dependency? version 6.x is available on creates.io: https://crates.io/crates/datafusion/versions.

Good catch on the readme and yes we should get that fixed. PRs welcome :)

mmuru commented 2 years ago

@houqp: Actually, the issue was rustc version must be > 1.56.1. I had to upgrade rustc version to latest (1.58.1) and afterward s I was able to build datafusion-python package. Sure, I will create a PR for documentation fix.

We noticed df.collect method performance is slow and I would like to discuss with you. Do you have an email I could reach out to you?

houqp commented 2 years ago

@mmuru for performance related issue, it's best if you can send a reproducible sample code to apache/arrow-datafusion repo and tag me so other people from the community can jump in to help as well.

mmuru commented 2 years ago

@houqp: Thanks. It was related to python binding collect() from dataframe.rs., so thought I ask here but will post the issue in apache/arrow-datafusion. Ideally, we need it should return PyResult<Vec>.

fn collect(&self, py: Python) -> PyResult<Vec<PyObject>> {
        let batches = wait_for_future(py, self.df.collect())?;
        // cannot use PyResult<Vec<RecordBatch>> return type due to
        // https://github.com/PyO3/pyo3/issues/1813
        batches.into_iter().map(|rb| rb.to_pyarrow(py)).collect()
    }
houqp commented 2 years ago

@mmuru are you able to reproduce the performance issue using just Rust code?

messense commented 2 years ago

Actually, the issue was rustc version must be > 1.56.1. I had to upgrade rustc version to latest (1.58.1) and afterward s I was able to build datafusion-python package. Sure, I will create a PR for documentation fix.

The error message from maturin is quite confusing and needs improvement, I've opened https://github.com/PyO3/maturin/issues/787 to track this.

jimexist commented 2 years ago

since this isn't an issue with this crate, closing this.