feat(table/scanner): Initial pass for planning a scan and returning the files to use

zeroshade commented 4 months ago

Very rough initial implementation of metrics evaluation and a simple scanner for Tables that produces the list of FileScanTasks to perform a scan along with positional delete files and so on.

This also includes a framework and setup for performing integration testing that is adapted from the approach used in pyiceberg, creating docker images and a file of tests which are only executed by setting the integration tag which is used in a new workflow which runs those tests.

This provides an end-to-end case of using a table and row-filter-expression to perform manifest and metrics evaluations to create the plan for scanning. The next step would be actually fetching the data!

zeroshade commented 3 months ago

@Fokko @nastra This should be ready for review now, though there's a weirdness in the number of data files being created for one of the integration testing tables on the CI here vs when I run the docker compose and provisioning locally. I don't know enough about spark-iceberg internals to know whether that is a quirk, expected, or something that I should change the tests for. Any ideas?

I've added a comment in scanner_test.go referencing the weirdness. You can also look at the failed CI runs for examples.

zeroshade commented 3 months ago

@nastra Any further comments?

nastra commented 3 months ago

thanks for the patience here @zeroshade. I'll do a full review in the next 2-3 days

nastra commented 3 months ago

@zeroshade could you please rebase this one now that all the other PRs are merged?

zeroshade commented 3 months ago

@nastra All rebased already :smile:

apache / iceberg-go

feat(table/scanner): Initial pass for planning a scan and returning the files to use #118