Open vasslitvinov opened 5 years ago
If it's not too much to ask, can you provide a code-snippet of what you'd envision that this would look like?
I do not have a proposal at the moment.
Something to consider is Categoricals in Pandas .
This is a good start!
See also the Categorical API reference for ideas. We don't have to do everything that pandas does or the way pandas does it, but their API is a good clue to the needs of the data science community.
Additional considerations:
This issue requests language support for "categorical data".
"Categorical data" is somewhat like an enum type. More specifically:
The set of categories, i.e. values in a collection of categorical data, may not be known at compile time. For example, it can be determined only after all the data has been read in.
The user would like to treat each category value as a string. For example, compute its length, print out as a string upon a writeln(), call other string functions on it.
The user would like to store each category as an integer. For example, an array of categories "strawberry", "banana", "vanilla" would be implemented as an array of 0s, 1s and 2s. This is to reduce the amount of storage needed for a collection of categorical data.
Option: have that integer be smaller bit-width when appropriate. For example, for <16k categories, a uint(16) would suffice. This would further reduce the storage size.
Option: support categorical data other than strings. For example, a dataset consisting of only the numbers 0, 40, 10^30, or of a small number of instances of a record or class type.
An interesting implementation challenge is to support more than one set of categories per program execution. In that case, the implementation needs to correlate an integer index with the lookup table. Possible solutions:
If the number of category sets is known at compile time, the index can be of a generic type parameterized by the table number.
Otherwise we could think about it as a runtime type. The runtime component of the type could indicate which lookup table to use.