datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

Generate expected data for integration tests as feather files #73

Closed Jefffrey closed 3 months ago

Jefffrey commented 3 months ago

Relates to #66

Use PyArrow to read ORC files and write the data as Arrow feather files. This is to have more robust equality checks instead of relying on JSON (which needs to be parsed back to Arrow first).

Generating the expected files is a once off activity, relevant script included.

progval commented 3 months ago

Would it make sense to make build.rs run the Python script, so .feather files don't have to be committed to Git?

Jefffrey commented 3 months ago

Would it make sense to make build.rs run the Python script, so .feather files don't have to be committed to Git?

Hmm that's a good point, I didn't consider that.

One caveat is we'd need to run in a Python venv or use a Docker container to handle the pyarrow package requirement in a robust manner

Jefffrey commented 3 months ago

Created an issue for the above

74