CODAIT / text-extensions-for-pandas

Natural language processing support for Pandas dataframes.
Apache License 2.0
215 stars 34 forks source link

Update Table Understanding conversion to accept updated schema #190

Open frreiss opened 3 years ago

frreiss commented 3 years ago

Recent versions of Watson Discovery have made undocumented changes to the format of the output of the Table Understanding enrichment. The old column names are documented at https://cloud.ibm.com/docs/discovery-data?topic=discovery-data-understanding_tables#table-output-schema

Rough translation of field names into the new naming convention:

new_name_to_old = {
    "row_min": "row_index_begin",
    "row_max": "row_index_end",
    "column_min": "column_index_begin",
    "column_max": "column_index_end",
    "cell_text": "text",
    "id": "cell_id"
}

Also, the field location at the top of the table record now appears to be optional.

Our conversion to Pandas needs to be updated to cover both the old schema and the new schema.

I recommend that we first determine which schema is the canonical one and convert non-canonical schemas to the canonical one as a preprocessing step.