AlexMili / torch-dataframe

Utility class to manipulate dataset from CSV file
MIT License
67 stars 8 forks source link

Converting categorical values (strings) to integers while keeping a translation table #4

Closed gforge closed 8 years ago

gforge commented 8 years ago

Quite frequently data in a CSV represents a non-numeric variables, e.g. male/femal, dog/cat/horse and it would be useful to import these as strings, convert them into integers between 1 and #myDF:unique("String Col")), i.e. the behaviour of pandas categorical dtype.

Keys

The keys for the conversion should be saved in a separate translator table initiated in the __init(). There are generally two directions for factors to/from numeric value. I suggest that one table, the to numeric is kept in a self.categorical that has keys according to column names.

New functions

Dataframe:as_categorical A factor conversion function for populating the self.factors combined with converting the values to numerics. Updates also the self.schema. Dataframe:to_categorical Takes a number, a tensor or a table and converts it to a string value or a string table if length > 1 Dataframe:from_categorical Takes a string, or a table of strings and converts them to a factor according to the key value

Adaptation of functions

update/insert/set Any change must be checked. If a numeric value is entered in a categorical column, should this be accepted or should we force the input to match the key table? load_csv/load_table should call __init() before running to clear all data including the categorical table _refresh_metadata should also include a check if the categorical is still present in the dataset get_column should have an option of getting categorical_as_string = true for forcing strings as default. reset_column should delete the categorical data and issue a warning rename_column must update the categorical table fill_na and fill_all_na must respect the categories. Not sure how to approach the default case for fill_all_na to_csv should create a copy with strings before passing the dataset onto csvigo head/tail/show/unique need to have categorical_as_string = true option where needs to check if the column is a categorical, perhaps add the categorical_as_string = true option

It is a rather extensive change that is needed but I would really love this feature. Any thoughts on the names or the approach?

AlexMili commented 8 years ago

I would totally need this feature too, it's a great idea !

Maybe it would be simpler to add the newly create categorical translation as a normal column, keep the factor as you described and handle the new column as usual. We also could handle new items insertion by offering automatic conversion between numerical values and categorical ones in the right column.

For update/insert/set I think it would be better to re-compute the categorical keys to add the new one (if it doesn't exist already of course)

What do you think ?

gforge commented 8 years ago

If I understand this correctly it would mean that we have two columns with the same information? I don't think that's a good idea:

I think that the function could btw use unique at the backend for generating the keys if we change the as_keys to return keynumbers ranging from 1 to #unique

AlexMili commented 8 years ago

You got a really good point. Let's do this