datopian / data-cli

data - command line tool for working with data, Data Packages and the DataHub
http://datahub.io/docs/features/data-cli
63 stars 8 forks source link

Core dump from data validation of larger files #348

Open zaneselvans opened 5 years ago

zaneselvans commented 5 years ago

When attempting to use the CLI to validate my data package using the command:

data validate datapackage.json

I first get a warning about a memory leak:

(node:21287) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 121 end listeners added. Use emitter.setMaxListeners() to increase limit

followed by a core dump an hour or more later. The data package I'm working with can be found on datahub. It currently consists of two tabular data resources. One (mines) contains ~30MB of CSV data, and triggers the memory leak warning but validates successfully in under a minute. The other (employment-production-quarterly) is ~160MB of CSV data, and also triggers the memory leak warning, proceeding to run for many minutes using ~100-150% of a CPU, while slowly and continuously increasing its memory footprint (but only up to ~10% of available memory), eventually resulting in the following error:

<--- Last few GCs --->

[21287:0x5610d38d7aa0]  2412787 ms: Mark-sweep 2011.0 (2121.7) -> 2011.0 (2091.2) MB, 1581.1 / 0.0 ms  last resort GC in old space requested
[21287:0x5610d38d7aa0]  2414404 ms: Mark-sweep 2011.0 (2091.2) -> 2011.0 (2091.7) MB, 1615.7 / 0.0 ms  last resort GC in old space requested

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x74ce6898fe1 <JSObject>
    1: push(this=0x1115d9486161 <JSArray[1956331]>)
    2: _callee2$ [/home/zane/anaconda3/lib/node_modules/data-cli/node_modules/tableschema/lib/table.js:~469] [pc=0x27385caf7d07](this=0x1115d94826e9 <Table map = 0x224e9de6491>,_context2=0x1115d9482689 <Context map = 0x224e9de1211>)
    3: tryCatch(aka tryCatch) [/home/zane/anaconda3/lib/node_modules/data-cli/node_modules/regenerator-runtime/run...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [node]
 2: 0x5610d228a3b3 [node]
 3: v8::Utils::ReportOOMFailure(char const*, bool) [node]
 4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [node]
 5: v8::internal::Factory::NewUninitializedFixedArray(int) [node]
 6: 0x5610d1e698a5 [node]
 7: 0x5610d1e69a9f [node]
 8: v8::internal::JSObject::AddDataElement(v8::internal::Handle<v8::internal::JSObject>, unsigned int, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow) [node]
 9: v8::internal::Object::AddDataProperty(v8::internal::LookupIterator*, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyAttributes, v8::internal::Object::ShouldThrow, v8::internal::Object::StoreFromKeyed) [node]
10: v8::internal::Object::SetProperty(v8::internal::LookupIterator*, v8::internal::Handle<v8::internal::Object>, v8::internal::LanguageMode, v8::internal::Object::StoreFromKeyed) [node]
11: v8::internal::Runtime::SetObjectProperty(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, v8::internal::LanguageMode) [node]
12: v8::internal::Runtime_SetProperty(int, v8::internal::Object**, v8::internal::Isolate*) [node]
13: 0x27385c8040bd

From within python, using goodtables.validate() on the same data package including all ~2 million records, validation completes successfully and takes about 10 minutes.

I am running Ubuntu 18.04.1 on a Thinkpad T470S with 2, 2-thread cores, and 24GB of RAM. The version of node (v8.11.1) and npm (v6.4.1) that I'm using are the ones distributed with the current anaconda3 distribution (v5.2). The version of data is 0.9.5.

zaneselvans commented 5 years ago

Core dump aside, it seems like the data validation could happen much faster somehow. Is it going through record by record? Or working on vectorized columns?

ezwelty commented 4 years ago

@zaneselvans When running goodtables.validate(), are you setting row_limit= to a large enough number to scan the whole table? At least on my system, the default limit is 1000. Checking because I am suspicious of your speed results (2 million records in 10 minutes); I would have expected it to be much slower based on testing with my own data...

"warnings": ["Table table.csv inspection has reached 1000 row(s) limit"]

At least goodtables-py is checking line by line. I agree this could probably be done much faster by working with vectorized columns.

zaneselvans commented 4 years ago

It's been a while! Not sure I remember whether I had the row limit set. Initially at least I was trying to validate everything. In the PUDL project now we are in theory going to to try and use goodtables programmatically, but it's not yet able to test all the things we want to test, structurally, about the data, so we're only running it on a few thousand rows, and the main structural validation we're doing is happening through actually pulling all the data into an SQLite database.