catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
480 stars 110 forks source link

Validating full epacems with goodtables_pandas runs out of memory #884

Closed zaneselvans closed 3 years ago

zaneselvans commented 3 years ago

At the end of the ETL process after the (compressed, partitioned) tabular data package for epacems has been output, we attempt to validate them using @ezwelty's goodtables_pandas library. However, if you do a significant subset of the available states and years on a single machine, you'll probably run out of memory, since the hourly_emissions_epacems table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.

It seems like something that could be done with dask if we wanted. But also it would be easy to just skip it. @rousik how does this end up working in the prefect & dask setup? Are the subsets of the data package validated separately on their own nodes?

rousik commented 3 years ago

Is it possible to validate the partitions independently? I.e. can we guarantee that as long as each partition passes the full table is also valid?

On Thu, Jan 14, 2021, 21:59 Zane Selvans notifications@github.com wrote:

At the end of the ETL process after the (compressed, partitioned) tabular data package has been output, we attempt to validate them using @ezwelty https://github.com/ezwelty's goodtables_pandas library. However, if you do a significant subset of the available states and years on a single machine, you'll probably run out of memory, since the hourly_emissions_epacems table has almost a billion rows in it. We need to either only validate a sample, or skip the validation, or come up with some way to serialize it when run on a single machine.

It seems like something that could be done with dask if we wanted. But also it would be easy to just skip it. @rousik https://github.com/rousik how does this end up working in the prefect & dask setup? Are the subsets of the data package validated separately on their own nodes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/884, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYSCGFV4S5N7RK2BUXTCHLSZ7DULANCNFSM4WDQJP5Q .

zaneselvans commented 3 years ago

In the general case I think the answer is no, like if you're verifying that a table's primary keys are unique. Though maybe that particular one is built into the pandas index somehow? The types of validation that happen are enumerated over here: https://github.com/ezwelty/goodtables-pandas-py

ezwelty commented 3 years ago

That table has an auto-incremented primary key, so it could be chopped into pieces and validated in chunks. If the file is already partitioned (into multiple csv files?), then this should be possible by overloading the path attribute in the resource schema for each partition path.

zaneselvans commented 3 years ago

Is the unique key the only kind of validation that requires some kind of memory of the other chunks?

I believe each year-state combo is a different resource within a "resource group" (as opposed to being literally concatenatable files) and those resources are used to generate a single table. Maybe this isn't actually a problem? It looks like merge_groups only comes into play when you try and load the package into e.g. SQL based on issue #364 (from back when we were working on all this with @roll and @frictionlessdata. I ran a test ETL last night with a dozen states for CEMS and I just saw my memory usage growing roughly linearly during the validation step. It happened to complete just before running out of memory... so maybe it won't actually crash? And it's just Python not garbage collecting?

ezwelty commented 3 years ago

Is the unique key the only kind of validation that requires some kind of memory of the other chunks?

Theoretically, only uniqueness and foreign keys. In practice, goodtables_pandas.validate as written is optimized for speed but not memory use. It currently reads in and parses all tables, then moves on to checking foreign keys. An obvious improvement I could make would be to only store foreign keys as needed, and only store whole tables if return_tables=True.

rousik commented 3 years ago

I suppose that as a stop-gap solution we could consider doing validation on a sampled subset of epacems data.

rousik commented 3 years ago

Now that we are emitting epacems files directly to parquet, this is no longer an issue as epacems tables are not included into datapackages anymore.