Unsupervised modeling of data

Objective: Uncover the generative model behind these data samples

Assumption 1: The latent states can be uncovered from the principal modes of app usage behavior.

Test:

[x] PCA dimensionality reduction
[x] Cluster with k-means

Conclusion: The principal modes only reveal information expressing which apps are mostly used. While this is interesting, it doesn't tell us much about the generative latent space. Euclidean metrics don't capture the nuances of the data.

Assumption 2: The data points are sampled from a space that captures the individual's mental state. This space has a Gaussian structure, i.e. each state in the space from which the data points are sampled has a Gaussian structure and is distributed N(m,s) such that m = E(x_i | x_i in X_1) and s^2 = E(X^2) - E(X)^2

[x] Cluster/Model with G-means G-means assumes the clusters are sampled from a latent Gaussian distribution. Consequently, each cluster represents samples from a distribution in a latent space, where each cluster indicates an underlying individual state, for example, focus on work, bored, actively checking social media, etc
[x] Cluster/Model with Gaussian Mixture Models using G-Means algorithm
[x] Use Variational Autoencoders to learn latent representation

Conclusions: Some Gaussian structure exists in the data since G-Means passes Anderson tests, however, the captured structure is still a little vague and there may be more information to be learned from dimensionality reduction techniques that account for time dependency.

Assumption 4: The Gaussian space from which data points are sampled is actually a generative latent space in lower dimensions. The latent space behaves as follows: y_i = generated(x_i | x_i in X & X = N(E(x_i),s) where X in Z s.t Z = latent space.

[x] Cluster/model with VAEs

Assumption 5: The latent space from which the data points are sampled possess a time-dependency structure that static Gaussian Mixture Models & G-Means aren't able to capture. The space behaves as follows: p(y_n+1) = p(y_n+1 | y_n..0)

[x] LSTM embedding matrix for dimensionality reduction
[ ] LSTM encoder-decoder for sequence-to-sequence modelling
[ ] LSTM encoder-decoder with attention mechanism

Assumption 5: The latent space from which the data points are generated from a latent space with both a Gaussian and time-dependency structure. y_i = Ax_i + Bx_(i-1)

[x] Hidden Markov Model to infer generative Markovian process
[x] (Variational) Autoencoder with time-dependency

asturkmani / Thesis

Unsupervised modeling of data #1