fmorenopino / HeterogeneousHMM

Discrete, Gaussian, and Heterogenous HMM models full implemented in Python. Missing data, Model Selection Criteria (AIC/BIC), and Semi-Supervised training supported. Easily extendable with other types of probablistic models.
https://pyhhmm.readthedocs.io/en/latest/
Apache License 2.0
74 stars 14 forks source link

Why we need many sequences? #6

Closed al33m501 closed 1 year ago

al33m501 commented 1 year ago

Hi! I carefully read the full example, and my question is about part HMM with labels.

I the sample dataset, we have 10 sequences split by time. Why do we need to do this, what's the purpose?

fmorenopino commented 1 year ago

Hi!

Not sure I understood your question correctly. In case you are wondering why we are using >1 sequence to learn the model's parameters (instead of just one), the answer is analogous to the training of any ML method. For the specific case of the HHMM, you could learn its parameters by just using one sequence, as you mentioned, but the ability of a model trained with that small amount of data to generalise would be very poor.

Anyhow, the examples from that notebook illustrate how to perform the most simple tasks that the library allows you. In a realistic setup, you would use as much data as possible.

al33m501 commented 1 year ago

@fmorenopino Sorry, it seems like I described the question inaccurately. In example we use 10 sequences with shapes (286,5) each one. Why we cant just use one sequence with shape (2860,5)?

fmorenopino commented 1 year ago

Understood! You could fit the model by using the one sequence that includes the other 10 sub-sequences. Using a separate dimension (besides time) to index different sequences is a software-design choice that facilitates dealing with your data before training the model.

Sometimes you may wish to divide your sequence of duration T in (let's say) 10 sequences of duration T/10 each. Other times, you have different sequences with no particular order. In this case, it does not seems best to force users to join all those different sequences along their time axis.

I hope it helps!