As the number of supported example datasets grows, it would be nice to standardize them. The tricky part is that the datasets come in multiple forms: numpy array, lists of numpy arrays, dataframes, lists of text (maybe more)? Further, some of them have labels, and others do not. Sklearn has a simple class that contains a data and a target field:
from sklearn import datasets
digits = datasets.load_digits(n_class=6)
data = digits.data
group = digits.target
The labels are called target because they are typically the target labels for a classification problem. In our case, that may or may not be true. So, do we keep with the sklearn API, or create a new simple class to organize our datasets?
perhaps something like this:
ds.data - the dataset, can be numpy array, df, str, list of str or mixed list
ds.labels - text label for each datapoint
ds.desc - short description of the dataset
As the number of supported example datasets grows, it would be nice to standardize them. The tricky part is that the datasets come in multiple forms: numpy array, lists of numpy arrays, dataframes, lists of text (maybe more)? Further, some of them have labels, and others do not. Sklearn has a simple class that contains a
data
and atarget
field:The labels are called
target
because they are typically the target labels for a classification problem. In our case, that may or may not be true. So, do we keep with the sklearn API, or create a new simple class to organize our datasets?perhaps something like this: