Closed danielezhu closed 8 months ago
Luca raised a good point about not casting every column. It is clear why we shouldn't do this if we consider the log probability columns. I will update this PR after Luca sends me the list of columns that should be converted to strings, and those that should not.
Looking at the data config in principle we should convert the following columns (if present):
model_input_location: Optional[str] = None
model_output_location: Optional[str] = None
target_output_location: Optional[str] = None
category_location: Optional[str] = None
sent_more_input_location: Optional[str] = None
sent_less_input_location: Optional[str] = None
However I have another question. Can in principle be the fields also lists? like having a list of target output(s) ? Or do we prohibit this?
Issue #, if available:
Description of changes: Currently, the JSON parsing code extracts raw data from the input dataset using user-provided JMESPath expressions. This raw data can be of any data type, such as booleans, floats, etc.
Certain columns (e.g. target output) should always be strings, even if the raw data from the dataset is not in string form. For example, the dataset could contain target outputs that are booleans, but due to the nature of the evaluation algorithms, the target output needs to be something like
"True"
or"False"
.This PR adds a function
cast_to_string
, and updates the_parse_column
method to callcast_to_string
if appropriate. Additionally, this PR changesclass ColumnNames(Enum)
inconstants.py
toclass DatasetColumns(Enum)
, where instead of enumerating just the names of the columns, it enumeratesColumn
objects, whereColumn
is a new class that I introduce.Column
contains aname
attribute which represents the original strings that were being enumerated byclass ColumnNames(Enum)
and ashould_cast
attribute which represents whether the contents of this column should be casted to string during data loading.The reason I added this class is so that we don't create a separate list for tracking which columns should/shouldn't be casted. Doing so would require updating this list manually whenever new column types are introduced, which is a source of human error that should be avoided. PR #171 was raised precisely because we were doing something similar in the past.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.