datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

Rigorous ORC integration tests #66

Closed Jefffrey closed 3 months ago

Jefffrey commented 3 months ago

Integration tests added by https://github.com/datafusion-contrib/datafusion-orc/pull/65

However we have to compare actual vs expected data in JSON format since that is how it is encoded in the Apache ORC repo

An alternative way could be to use the pyarrow/arrow ORC implementation to generate the expected files into a parquet or arrow flight file format which can be more rigorous than JSON

We lose visibility on the expected data a bit but since these are integration tests with data from Apache ORC repo, they wouldn't change often (if at all) anyway

Jefffrey commented 3 months ago

Almost done, just want to optimize this code:

https://github.com/datafusion-contrib/datafusion-orc/blob/fd23fdb61599dd52753d70fc808babd289e5c422/tests/integration/main.rs#L27-L38

Because it is major slowdown for the zlib test

Jefffrey commented 3 months ago

https://github.com/datafusion-contrib/datafusion-orc/commit/0405e23a291ead841353a182aab1338bd7b0c8cf

This commit introduces concatenating the vec of recordbatches into single recordbatch for easier comparison.

Had to disable 2 other tests due to some schema issues, but will work on that separately. Closing this issue