Closed zaneselvans closed 4 years ago
@zaneselvans Thanks, I'll look into it.
I think it should be the duplicate_row
check which has a linear memory usage. Until it's resolved, could you try validate(..., ignore_checks=['duplicate-row'])
?
Hmm, with row_limit=1000
I really wouldn't expect memory usage to be an issue, even if we read in 1000 rows from every single table and it held them all in memory until the process was complete (that would only be about 1.3 million records). But I can try running it with duplicate row disabled and see what happens. I can also try validating one of the partitioned EPA CEMS resources in isolation and see what happens.
@zaneselvans
I will be fully back from the next week I think (holidays in Russia is kinda shifted) and will take a look. The duplicate-row
was my best guess but I haven't tried yet
noting from in-person conversation: it would be helpful to have a message saying that only the first 10 resources are being checked.
Maybe this is already an option and I'm just not aware of it, but it could also be useful to have a way of specifying which resources should be validated, if all of them aren't going to be. E.g. in PUDL there's a bunch of small tables containing codes and IDs that aren't really data and it'd be fine if they were skipped, and there's thousands of individual CEMS resources but just checking a few from different states and years would be fine, and then there are some "core" resources that are unique and full of rich data that should all definitely be validated.
I'm merging the issue to the issues about the new goodtables
version work where I'm going to handle gigantic datasets validation - https://github.com/frictionlessdata/goodtables-py/issues/341
Overview
We recently noticed that by default, goodtables was only validating the first 10 tabular data resources in a tabular datapackage, and we really want it to validate at least a subset of all the tables (currently
row_limit=1000
). I settable_limit=-1
to have it check all the tables, and then gave it a datapackage with a large number of resources (~1300), most of which represent a single very long table (~1 billion rows) stored as a resource group. Previous ETL runs with ~40 resources (including 12 in the resource group) had succeeded just fine, but in this case, after spending quite a while validating samples of all the resources, it crashed with the following error:Prior to this error, the process wasn't consuming an inordinate quantity of CPU or memory resources.
Since the -1 value for
table_limit
was just a guess for how to tell it to be unlimited, I switched totable_limit=2000
which should have been enough to allow all the tables to get validated, and got the same error. Then I usedtable_limit=100
and still got the same error again. Removingtable_limit
entirely, the validate succeeds, but I suspect this is just because it's not encountering any of the tables which are part of the large EPA CEMS record group. In addition, the CSVs for those tabular data resources are gzipped.Please preserve this line to notify @roll (lead of this repository)