georgian-io-archive / foreshadow

An automatic machine learning system
https://foreshadow.readthedocs.io
Apache License 2.0
29 stars 2 forks source link

JSON flattener only process a column that are 100% valid JSON, single… #203

Closed jzhang-gp closed 4 years ago

jzhang-gp commented 4 years ago

Description

JSON flattener should only process columns that are 100% valid JSON.

One of the issues we noticed is the single quoted fake JSON format. A valid JSON object should use double quote on the key and value field:

{
  "firstName": "John",
  "lastName": "Smith",
  "age": 27
}

NOT


{
  'firstName': 'John',
  'lastName': 'Smith',
  'age': 27,
}
The problem we have right now is that when calculating the metric score to determine which flattener to use (well we only have one for now), JSONFlattener may be triggered due to some empty field in the column and thus json.load() does not throw exceptions. Even though the rest of the columns are invalid JSON and throw exceptions, the final score is still greater than 0. The JSONFlattener is eventually chosen since there is no better options, and it fails the downstream.

We should, at least for now,  enforce valid JSON format on the input.