apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[C++][Parquet] Improve/expand functional unit tests #42376

Closed asfimport closed 8 years ago

asfimport commented 8 years ago

We need to add a testing framework for unit tests, and run it as a part of each Travis CI build.

Reporter: Aliaksei Sandryhaila / @asandryh Assignee: Aliaksei Sandryhaila / @asandryh

Note: This issue was originally created as PARQUET-479. Please see the migration documentation for further details.

asfimport commented 8 years ago

Wes McKinney / @wesm: Can you explain how you envision regression testing fitting into the development workflow, as compared with functional unit tests (verifying correctness)?

asfimport commented 8 years ago

Aliaksei Sandryhaila / @asandryh: Since we do not have writing functionality yet, the first iteration of tests will use pre-generated parquet files in /data/ (existing and new files).

asfimport commented 8 years ago

Aliaksei Sandryhaila / @asandryh: In our case, regression testing will consist of running all functional unit tests on each modification. This will ensure that we do not mess up the already implemented, presumably correct functionality.

asfimport commented 8 years ago

Wes McKinney / @wesm: How is this different from just running the test suite with ctest? That is already part of the Travis CI build script.

asfimport commented 8 years ago

Aliaksei Sandryhaila / @asandryh: Ah, I missed that you've already added it in .travis.yml a few days ago.

asfimport commented 8 years ago

Wes McKinney / @wesm: This JIRA does not have a well defined scope. Almost all patches need to be accompanied by unit tests – the problem right now is that we need a way to generate test data using parquet-mr (or some other tool) so that tests can be written right now for reader functionality until parquet-cpp has write capability. Another option is that we can mock out details of the file format (e.g. data and dictionary pages) and write tests that way (starting first with testing the value encoders and decoders so we know we can generate data pages in memory).

asfimport commented 8 years ago

Aliaksei Sandryhaila / @asandryh: So far the jira is a bit vague because its first objective is to discuss and decide on the testing setup. :) Just to be clear: by "generate test data using parquet-mr," do you mean to do this offline and add files to the repository, e.g. to /data directory?

asfimport commented 8 years ago

Wes McKinney / @wesm: I definitely don't want to bloat the git repo. So if we go that route, either we would host test data files outside of the main git repo or have a data generation script that creates them from scratch locally. parquet-mr probably never had to face this issue because it was the proverbial chicken.

My preference would be to focus on testing round-tripping data from the ground up, but I also need to be able to write Parquet files =) It might be useful to have some "smoke tests" that use external pre-generated data files but it doesn't feel like a scalable solution (e.g. bug fixes may require generating the right file to reproduce a bug).

asfimport commented 8 years ago

Aliaksei Sandryhaila / @asandryh: IMHO, it's not a big issue to add a few parquet files to /data for the time being. As soon as we can write, we'll remove these files and update the corresponding tests.

asfimport commented 8 years ago

Wes McKinney / @wesm: This is fine with me, as long as we don't exceed a few megabytes. My priority will definitely be to have test fixtures ASAP that enable data to be round-tripped to in-memory buffers without having to assemble a fully formed file – for the purposes of verifying column reading you only need to be able to generate the different encoded page types.

asfimport commented 8 years ago

Wes McKinney / @wesm: I thought some more about this, and I'm not supportive of checking in more test data files until we've improved our ability to unit test the existing code (http://martinfowler.com/bliki/TestPyramid.html). Let's take the discussion to the mailing list thread about this, and as we identify well-defined tasks to improve the test infrastructure we can create new JIRAs.

asfimport commented 8 years ago

Aliaksei Sandryhaila / @asandryh: This is not an issue, but rather a discussion on functional and intergration tests. It has been moved to https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit#.