Closed gforge closed 8 years ago
I would totally need this feature too, it's a great idea !
Maybe it would be simpler to add the newly create categorical translation as a normal column, keep the factor as you described and handle the new column as usual. We also could handle new items insertion by offering automatic conversion between numerical values and categorical ones in the right column.
For update/insert/set
I think it would be better to re-compute the categorical keys to add the new one (if it doesn't exist already of course)
What do you think ?
If I understand this correctly it would mean that we have two columns with the same information? I don't think that's a good idea:
mode = 'large'
suggests that this may be the case - I think btw that we should add an issue for supporting 'large'). Copying a column is expensive while storing numerical values instead of strings together with a lookup table is cheap.I think that the function could btw use unique
at the backend for generating the keys if we change the as_keys
to return keynumbers ranging from 1 to #unique
You got a really good point. Let's do this
Quite frequently data in a CSV represents a non-numeric variables, e.g. male/femal, dog/cat/horse and it would be useful to import these as strings, convert them into integers between 1 and
#myDF:unique("String Col"))
, i.e. the behaviour of pandas categorical dtype.Keys
The keys for the conversion should be saved in a separate translator table initiated in the
__init()
. There are generally two directions for factors to/from numeric value. I suggest that one table, the to numeric is kept in aself.categorical
that has keys according to column names.New functions
Dataframe:as_categorical
A factor conversion function for populating theself.factors
combined with converting the values to numerics. Updates also theself.schema
.Dataframe:to_categorical
Takes a number, a tensor or a table and converts it to a string value or a string table if length > 1Dataframe:from_categorical
Takes a string, or a table of strings and converts them to a factor according to the key valueAdaptation of functions
update
/insert
/set
Any change must be checked. If a numeric value is entered in a categorical column, should this be accepted or should we force the input to match the key table?load_csv
/load_table
should call __init() before running to clear all data including the categorical table_refresh_metadata
should also include a check if the categorical is still present in the datasetget_column
should have an option of getting categorical_as_string = true for forcing strings as default.reset_column
should delete the categorical data and issue a warningrename_column
must update the categorical tablefill_na
andfill_all_na
must respect the categories. Not sure how to approach the default case for fill_all_nato_csv
should create a copy with strings before passing the dataset onto csvigohead
/tail
/show
/unique
need to have categorical_as_string = true optionwhere
needs to check if the column is a categorical, perhaps add the categorical_as_string = true optionIt is a rather extensive change that is needed but I would really love this feature. Any thoughts on the names or the approach?