eonu / sequentia

Scikit-Learn compatible HMM and DTW based sequence machine learning algorithms in Python.
https://pypi.org/project/sequentia/
MIT License
63 stars 8 forks source link

Add sequentia.datasets module for readily available real-world datasets #141

Closed eonu closed 2 years ago

eonu commented 3 years ago

For example and documentation purposes, it would be useful to have a sequentia.datasets module that would come with real-world datasets in a format that is ready to be used with sequentia.

These would probably be public datasets (e.g. from Kaggle or the UCI Machine Learning Repository), but we should ensure that it would be allowed to effectively redistribute whatever datasets we are using.

Some good examples are:

It would also be good if the data loader worked in a compatible way with the GMMHMM and HMMClassifier classes, e.g.:

from sequentia.classifiers import GMMHMM, HMMClassifier
from sequentia.datasets import trajectories

dataset = trajectories(n_class=10)

hmms = []
for X, k in dataset.iter_by_class():
    hmm = GMMHMM(label=k, n_states=3, n_components=10)
    hmm.set_random_initial()
    hmm.set_random_transitions()
    hmm.fit(X)
    hmms.append(hmm)

hmm_clf = HMMClassifier()
hmm_clf.fit(hmms)

but also in a way that works well with the KNNClassifier, e.g.:

from sequentia.classifiers import KNNClassifier
from sequentia.datasets import trajectories

dataset = trajectories(n_class=10)

knn_clf = KNNClassifier(k=3, classes=dataset.classes)
knn_clf.fit(*dataset.data)
# where dataset.data = (X, y)

Though we also need to consider making it easy to split the data into training/validation/test sets.

Note: Make sure to fix the tests that were skipped in #201.

eonu commented 2 years ago

Implemented in #214.