Closed zaneselvans closed 3 years ago
Is it possible to validate the partitions independently? I.e. can we guarantee that as long as each partition passes the full table is also valid?
On Thu, Jan 14, 2021, 21:59 Zane Selvans notifications@github.com wrote:
At the end of the ETL process after the (compressed, partitioned) tabular data package has been output, we attempt to validate them using @ezwelty https://github.com/ezwelty's goodtables_pandas library. However, if you do a significant subset of the available states and years on a single machine, you'll probably run out of memory, since the hourly_emissions_epacems table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.
It seems like something that could be done with dask if we wanted. But also it would be easy to just skip it. @rousik https://github.com/rousik how does this end up working in the prefect & dask setup? Are the subsets of the data package validated separately on their own nodes?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/884, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYSCGFV4S5N7RK2BUXTCHLSZ7DULANCNFSM4WDQJP5Q .
In the general case I think the answer is no, like if you're verifying that a table's primary keys are unique. Though maybe that particular one is built into the pandas index somehow? The types of validation that happen are enumerated over here: https://github.com/ezwelty/goodtables-pandas-py
That table has an auto-incremented primary key, so it could be chopped into pieces and validated in chunks. If the file is already partitioned (into multiple csv files?), then this should be possible by overloading the path
attribute in the resource schema for each partition path.
Is the unique key the only kind of validation that requires some kind of memory of the other chunks?
I believe each year-state combo is a different resource within a "resource group" (as opposed to being literally concatenatable files) and those resources are used to generate a single table. Maybe this isn't actually a problem? It looks like merge_groups
only comes into play when you try and load the package into e.g. SQL based on issue #364 (from back when we were working on all this with @roll and @frictionlessdata. I ran a test ETL last night with a dozen states for CEMS and I just saw my memory usage growing roughly linearly during the validation step. It happened to complete just before running out of memory... so maybe it won't actually crash? And it's just Python not garbage collecting?
Is the unique key the only kind of validation that requires some kind of memory of the other chunks?
Theoretically, only uniqueness and foreign keys. In practice, goodtables_pandas.validate
as written is optimized for speed but not memory use. It currently reads in and parses all tables, then moves on to checking foreign keys. An obvious improvement I could make would be to only store foreign keys as needed, and only store whole tables if return_tables=True
.
I suppose that as a stop-gap solution we could consider doing validation on a sampled subset of epacems data.
Now that we are emitting epacems files directly to parquet, this is no longer an issue as epacems tables are not included into datapackages anymore.
At the end of the ETL process after the (compressed, partitioned) tabular data package for
epacems
has been output, we attempt to validate them using @ezwelty'sgoodtables_pandas
library. However, if you do a significant subset of the available states and years on a single machine, you'll probably run out of memory, since thehourly_emissions_epacems
table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.It seems like something that could be done with
dask
if we wanted. But also it would be easy to just skip it. @rousik how does this end up working in theprefect
&dask
setup? Are the subsets of the data package validated separately on their own nodes?