In the previous release of keypoint-MoSeq, we included an "extract_results" step that saved syllable sequences along with a "reindexed" version of the syllable sequences in which syllables were re-labeled by frequency (so syllable "0" was the most frequent, and so on). But this approach had a fatal flaw: when a fitted model was applied to new data, the syllable frequencies could be different, which would lead to a slightly different re-labeling, so that e.g. syllable "0" would refer to one state in a subset of recordings and a different state in another subset.
In the future though, it would still be nice to have reindexing as an option so that the syllable labels aren't a random sparse subset of numbers between 0 and 100. To make the reindexing consistent, I propose that we reindex the model itself in addition to the outputs. Here's how this would work:
At this step in the pipeline (after modeling, prior to extracting results) we insert a new reindexing step. This step might look like the following, where the second step saves the updated checkpoint to disk.
The reindex_checkpoint function would calculate a new ordering of syllables based on frequency, and then it would systematically permute all the contents of the checkpoint based on this new ordering (i.e. the AR params, transition matrix, syllable labels, etc., including those stored in the "history")
In the previous release of keypoint-MoSeq, we included an "extract_results" step that saved syllable sequences along with a "reindexed" version of the syllable sequences in which syllables were re-labeled by frequency (so syllable "0" was the most frequent, and so on). But this approach had a fatal flaw: when a fitted model was applied to new data, the syllable frequencies could be different, which would lead to a slightly different re-labeling, so that e.g. syllable "0" would refer to one state in a subset of recordings and a different state in another subset.
As a temporary fix for this, I removed all reindexing from the pipeline (see https://github.com/dattalab/keypoint-moseq/commit/304fcf41732cff95739f31ff1e86fb03c1e204b4 and https://github.com/dattalab/keypoint-moseq/commit/45de8a18738f12404309b6a2d85e68d5adb77dd5).
In the future though, it would still be nice to have reindexing as an option so that the syllable labels aren't a random sparse subset of numbers between 0 and 100. To make the reindexing consistent, I propose that we reindex the model itself in addition to the outputs. Here's how this would work:
reindex_checkpoint
function would calculate a new ordering of syllables based on frequency, and then it would systematically permute all the contents of the checkpoint based on this new ordering (i.e. the AR params, transition matrix, syllable labels, etc., including those stored in the"history"
)