Class for datasets - Githubissues

andrewheusser commented 6 years ago

As the number of supported example datasets grows, it would be nice to standardize them. The tricky part is that the datasets come in multiple forms: numpy array, lists of numpy arrays, dataframes, lists of text (maybe more)? Further, some of them have labels, and others do not. Sklearn has a simple class that contains a data and a target field:

from sklearn import datasets
digits = datasets.load_digits(n_class=6)
data = digits.data
group = digits.target

The labels are called target because they are typically the target labels for a classification problem. In our case, that may or may not be true. So, do we keep with the sklearn API, or create a new simple class to organize our datasets?

perhaps something like this:

ds.data - the dataset, can be numpy array, df, str, list of str or mixed list
ds.labels - text label for each datapoint
ds.desc - short description of the dataset

andrewheusser commented 6 years ago

we will also be loading models (like the wiki model)...we could fit them in to this data model, create a new class, or just leave them as they are...

andrewheusser commented 6 years ago

we decided to turn the example datasets into geos

ContextLab / hypertools

Class for datasets #188