frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
703 stars 148 forks source link

goodtables crashes when validating a large number of tabular data resources #321

Closed zaneselvans closed 4 years ago

zaneselvans commented 4 years ago

Overview

We recently noticed that by default, goodtables was only validating the first 10 tabular data resources in a tabular datapackage, and we really want it to validate at least a subset of all the tables (currently row_limit=1000). I set table_limit=-1 to have it check all the tables, and then gave it a datapackage with a large number of resources (~1300), most of which represent a single very long table (~1 billion rows) stored as a resource group. Previous ETL runs with ~40 resources (including 12 in the resource group) had succeeded just fine, but in this case, after spending quite a while validating samples of all the resources, it crashed with the following error:

Traceback (most recent call last):
  File "/home/zane/miniconda3/envs/pudl-dev/bin/pudl_etl", line 11, in <module>
  File "/home/zane/code/catalyst/pudl/src/pudl/cli.py", line 110, in main
  File "/home/zane/code/catalyst/pudl/src/pudl/etl.py", line 817, in generate_datapkg_bundle
  File "/home/zane/code/catalyst/pudl/src/pudl/load/metadata.py", line 709, in generate_metadata
  File "/home/zane/code/catalyst/pudl/src/pudl/load/metadata.py", line 576, in validate_save_datapkg
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/goodtables/validate.py", line 80, in validate
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/goodtables/inspector.py", line 82, in inspect
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/multiprocessing/pool.py", line 657, in get
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/multiprocessing/pool.py", line 121, in worker
  File "/home/zane/miniconda3/envs/pudl-dev/lib/python3.7/site-packages/goodtables/inspector.py", line 222, in __inspect_table
UnboundLocalError: local variable 'sample' referenced before assignment

Prior to this error, the process wasn't consuming an inordinate quantity of CPU or memory resources.

Since the -1 value for table_limit was just a guess for how to tell it to be unlimited, I switched to table_limit=2000 which should have been enough to allow all the tables to get validated, and got the same error. Then I used table_limit=100 and still got the same error again. Removing table_limit entirely, the validate succeeds, but I suspect this is just because it's not encountering any of the tables which are part of the large EPA CEMS record group. In addition, the CSVs for those tabular data resources are gzipped.


Please preserve this line to notify @roll (lead of this repository)

roll commented 4 years ago

@zaneselvans Thanks, I'll look into it.

I think it should be the duplicate_row check which has a linear memory usage. Until it's resolved, could you try validate(..., ignore_checks=['duplicate-row'])?

zaneselvans commented 4 years ago

Hmm, with row_limit=1000 I really wouldn't expect memory usage to be an issue, even if we read in 1000 rows from every single table and it held them all in memory until the process was complete (that would only be about 1.3 million records). But I can try running it with duplicate row disabled and see what happens. I can also try validating one of the partitioned EPA CEMS resources in isolation and see what happens.

roll commented 4 years ago

@zaneselvans I will be fully back from the next week I think (holidays in Russia is kinda shifted) and will take a look. The duplicate-row was my best guess but I haven't tried yet

lwinfree commented 4 years ago

noting from in-person conversation: it would be helpful to have a message saying that only the first 10 resources are being checked.

zaneselvans commented 4 years ago

Maybe this is already an option and I'm just not aware of it, but it could also be useful to have a way of specifying which resources should be validated, if all of them aren't going to be. E.g. in PUDL there's a bunch of small tables containing codes and IDs that aren't really data and it'd be fine if they were skipped, and there's thousands of individual CEMS resources but just checking a few from different states and years would be fine, and then there are some "core" resources that are unique and full of rich data that should all definitely be validated.

roll commented 4 years ago

I'm merging the issue to the issues about the new goodtables version work where I'm going to handle gigantic datasets validation - https://github.com/frictionlessdata/goodtables-py/issues/341