feat: stringify dataset column contents during data loading

danielezhu commented 8 months ago

Issue #, if available:

Description of changes: Currently, the JSON parsing code extracts raw data from the input dataset using user-provided JMESPath expressions. This raw data can be of any data type, such as booleans, floats, etc.

Certain columns (e.g. target output) should always be strings, even if the raw data from the dataset is not in string form. For example, the dataset could contain target outputs that are booleans, but due to the nature of the evaluation algorithms, the target output needs to be something like "True" or "False".

This PR adds a function cast_to_string, and updates the _parse_column method to call cast_to_string if appropriate. Additionally, this PR changes class ColumnNames(Enum) in constants.py to class DatasetColumns(Enum), where instead of enumerating just the names of the columns, it enumerates Column objects, where Column is a new class that I introduce.

Column contains a name attribute which represents the original strings that were being enumerated by class ColumnNames(Enum) and a should_cast attribute which represents whether the contents of this column should be casted to string during data loading.

The reason I added this class is so that we don't create a separate list for tracking which columns should/shouldn't be casted. Doing so would require updating this list manually whenever new column types are introduced, which is a source of human error that should be avoided. PR #171 was raised precisely because we were doing something similar in the past.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

danielezhu commented 8 months ago

Luca raised a good point about not casting every column. It is clear why we shouldn't do this if we consider the log probability columns. I will update this PR after Luca sends me the list of columns that should be converted to strings, and those that should not.

lucfra commented 8 months ago

Looking at the data config in principle we should convert the following columns (if present):

    model_input_location: Optional[str] = None
    model_output_location: Optional[str] = None
    target_output_location: Optional[str] = None
    category_location: Optional[str] = None
    sent_more_input_location: Optional[str] = None
    sent_less_input_location: Optional[str] = None

lucfra commented 8 months ago

However I have another question. Can in principle be the fields also lists? like having a list of target output(s) ? Or do we prohibit this?

aws / fmeval

feat: stringify dataset column contents during data loading #168