YttriLab / A-SOID

An active learning platform for expert-guided, data efficient discovery of behavior.
Other
55 stars 7 forks source link

embedding_output.sav #63

Closed pozel closed 10 months ago

pozel commented 10 months ago

When using the embedding_output.sav file to do exploratory analysis on clusters found from unsupervised learning, I tried to open the file via SPSS. How are others extracting information from this file? Is it possible to export this data to a csv format in the gui?

asoid_troubleshoot_sav

JensBlack commented 10 months ago

Hi! Thanks for using A-SOiD. The file format is used to save some internal information and the gui has no feature to export directly. however, the results of the clustering can be exported in the directed discovery step.

If you want to open the sav file, you can use this code snippet to do it in python:

import joblib
path_to_sav = r"FULL/PATH/EMBEDDING.sav"
  with open(path_to_sav, 'rb') as fr:
      [umap_embeddings, assignments, soft_assignments, pred_assign] = joblib.load(fr)

Structure

Each parameter is a dictionary of the following structure:

target_behaviors = ["grooming", "sniffing", "turn", "locomotion"]

umap_embeddings = {key: [] for key in target_behaviors}
assignments = {key: [] for key in target_behaviors}
soft_assignments = {key: [] for key in target_behaviors}
pred_assign = {key: [] for key in target_behaviors}

so you can take the directed discovery results from each behavior seperate by using the target_behavior name as a key.

target_behavior = "grooming"
umap_embedd_groom = umap_embeddings[target_behavior]
pred_assign_groom = umap_embeddings[target_behavior]

The assignments are a label (0-n_clusters) per row. the embeddings are the multidimensional embedding based on the features. Note that your entire data is concatenated in there, so differentiating between input sessions is not possible without backtracing the feauture extraction process.

Visualization:

We are using the first two dimensions of the embedding to visualize in the App and labels from pred_assign:

Here is a quick plot to do this:

import matplotlib.pyplot as plt
import numpy as np

plt.style.use('default')
def plot_hdbscan_embedding_matplotlib(assign, embeds, behav = "test"):

        unique_classes = np.unique(assign)
        group_types = ['Group {}'.format(i) for i in unique_classes if i >= 0]
        if -1 in unique_classes:
            group_types = ["Noise"] + group_types

        fig, ax = plt.subplots(figsize=(10, 10))
        for num, g in enumerate(unique_classes):
            idx = np.where(assign == g)[0]
            ax.scatter(embeds[idx, 0],
                       embeds[idx, 1],
                       label=group_types[num],
                       s=3
                       )
        ax.legend()
        ax.set_title(f'{behav.capitalize()}')
        ax.set_xlabel(f'UMAP (Dim. 1)')
        ax.set_ylabel(f'UMAP (Dim. 2)')
        ax.set_aspect('equal', 'datalim')
        #remove ticks
        ax.set_xticks([])
        ax.set_yticks([])
        #remove borders
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)
        plt.show()
        return fig

Example Result:

grafik


SPSS

Unfortunately I am not working with SPSS myself, so I am unsure if you can import these files directly. However, after you trained your active learning algorithm with the new clusters, you can use it to predict the clusters on your data. this will result in csv files that are in a standard format and split by input session.

Let me know if this helps!