For example and documentation purposes, it would be useful to have a sequentia.datasets module that would come with real-world datasets in a format that is ready to be used with sequentia.
These would probably be public datasets (e.g. from Kaggle or the UCI Machine Learning Repository), but we should ensure that it would be allowed to effectively redistribute whatever datasets we are using.
Some good examples are:
the character trajectories dataset that is currently being used in the tutorial notebooks,
a speech recognition dataset of MFCCs for isolated spoken words/characters/numbers (with the option of adding ∆ and ∆∆ features too),
a sentiment analysis or text classification dataset of sentences/phrases (represented as a sequence of word embeddings).
It would also be good if the data loader worked in a compatible way with the GMMHMM and HMMClassifier classes, e.g.:
from sequentia.classifiers import GMMHMM, HMMClassifier
from sequentia.datasets import trajectories
dataset = trajectories(n_class=10)
hmms = []
for X, k in dataset.iter_by_class():
hmm = GMMHMM(label=k, n_states=3, n_components=10)
hmm.set_random_initial()
hmm.set_random_transitions()
hmm.fit(X)
hmms.append(hmm)
hmm_clf = HMMClassifier()
hmm_clf.fit(hmms)
but also in a way that works well with the KNNClassifier, e.g.:
from sequentia.classifiers import KNNClassifier
from sequentia.datasets import trajectories
dataset = trajectories(n_class=10)
knn_clf = KNNClassifier(k=3, classes=dataset.classes)
knn_clf.fit(*dataset.data)
# where dataset.data = (X, y)
Though we also need to consider making it easy to split the data into training/validation/test sets.
Note: Make sure to fix the tests that were skipped in #201.
For example and documentation purposes, it would be useful to have a
sequentia.datasets
module that would come with real-world datasets in a format that is ready to be used withsequentia
.These would probably be public datasets (e.g. from Kaggle or the UCI Machine Learning Repository), but we should ensure that it would be allowed to effectively redistribute whatever datasets we are using.
Some good examples are:
(with the option of adding ∆ and ∆∆ features too),
(represented as a sequence of word embeddings).
It would also be good if the data loader worked in a compatible way with the
GMMHMM
andHMMClassifier
classes, e.g.:but also in a way that works well with the
KNNClassifier
, e.g.:Though we also need to consider making it easy to split the data into training/validation/test sets.
Note: Make sure to fix the tests that were skipped in #201.