Expose `n_bins` argument to `cebra_sklearn_helpers.align_embeddings` instead of fixing default value internally

drsax93 commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Bug description

Hello, I am trying to compute the consistency score across different embeddings from hippocampal population activity that have been obtained using 2d tracking position as the auxiliary variable. To compute the consistency score I have tried to use as labels either the linearised 2d position or another discrete labelling, but I get an error in cebra_sklearn_helpers.align_embeddings when quantising the embeddings with new labels. I believe it might be due to the high number of bins (n_bins) used within the _coarse_to_fine() function.. What do you think the issue may be?

Operating System

operating system: Ubuntu 20.04

CEBRA version

cebra version 0.2.0

Device type

gpu

Steps To Reproduce

Here is a snippet of the code

# Between-datasets consistency, by aligning on the labels
import cebra

 # embeddings as list of np.ndarrays
embds = [cebra_w[e][m] for e in exps for m in MICE[:2]]
 # labels as list of 1d np.ndarrays() w linearised tracking position
labels = [lineariseTrack(track[e][m][:,0], track[e][m][:,1], binsize=30)\
          for e in exps for m in MICE[:2]]

scores, pairs, datasets = cebra.sklearn.metrics.consistency_score(embeddings=embds,
                                                                  labels=labels,
                                                                  between="datasets")

Relevant log output

ValueError                                Traceback (most recent call last)
Cell In[16], line 7
      3 embds = [cebra_w[e][m] for e in exps for m in MICE[:2]]
      4 labels = [lineariseTrack(track[e][m][:,0], track[e][m][:,1], binsize=30)\
      5           for e in exps for m in MICE[:2]]
----> 7 scores, pairs, datasets = cebra.sklearn.metrics.consistency_score(embeddings=embds,
      8                                                                   labels=labels,
      9                                                                   between="datasets")
     10 cebra.plot_consistency(scores, pairs=pairs, datasets=subjects, colorbar_label=None)

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/metrics.py:362, in consistency_score(embeddings, between, labels, dataset_ids)
    359     scores, pairs, datasets = _consistency_runs(embeddings=embeddings,
    360                                                 dataset_ids=dataset_ids)
    361 elif between == "datasets":
--> 362     scores, pairs, datasets = _consistency_datasets(embeddings=embeddings,
    363                                                     dataset_ids=dataset_ids,
    364                                                     labels=labels)
    365 else:
    366     raise NotImplementedError(
    367         f"Invalid comparison, got between={between}, expects either datasets or runs."
    368     )

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/metrics.py:205, in _consistency_datasets(embeddings, dataset_ids, labels)
    200     raise ValueError(
    201         "Invalid number of dataset_ids, expect more than one dataset to perform the comparison, "
    202         f"got {len(datasets)}")
    204 # NOTE(celia): with default values normalized=True and n_bins = 100
--> 205 aligned_embeddings = cebra_sklearn_helpers.align_embeddings(
    206     embeddings, labels)
    207 scores, pairs = _consistency_scores(aligned_embeddings,
    208                                     datasets=dataset_ids)
    209 between_dataset = [p[0] != p[1] for p in pairs]

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/helpers.py:138, in align_embeddings(embeddings, labels, normalize, n_bins)
    133 digitized_labels = np.digitize(
    134     valid_labels, np.linspace(min_labels_value, max_labels_value,
    135                               n_bins))
    137 # quantize embedding based on the new labels
--> 138 quantized_embedding = [
    139     _coarse_to_fine(valid_embedding, digitized_labels, bin_idx)
    140     for bin_idx in range(n_bins)[1:]
    141 ]
    143 if normalize:  # normalize across dimensions
    144     quantized_embedding_norm = [
    145         quantized_sample / np.linalg.norm(quantized_sample, axis=0)
    146         for quantized_sample in quantized_embedding
    147     ]

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/helpers.py:139, in <listcomp>(.0)
    133 digitized_labels = np.digitize(
    134     valid_labels, np.linspace(min_labels_value, max_labels_value,
    135                               n_bins))
    137 # quantize embedding based on the new labels
    138 quantized_embedding = [
--> 139     _coarse_to_fine(valid_embedding, digitized_labels, bin_idx)
    140     for bin_idx in range(n_bins)[1:]
    141 ]
    143 if normalize:  # normalize across dimensions
    144     quantized_embedding_norm = [
    145         quantized_sample / np.linalg.norm(quantized_sample, axis=0)
    146         for quantized_sample in quantized_embedding
    147     ]

File /data/phar0731/anaconda3/envs/py38/lib/python3.8/site-packages/cebra/integrations/sklearn/helpers.py:78, in _coarse_to_fine(data, digitized_labels, bin_idx)
     76     if quantized_data is not None:
     77         return quantized_data
---> 78 raise ValueError(
     79     f"Digitalized labels does not have elements close enough to bin index {bin_idx}. "
     80     f"The bin index should be in the range of the labels values.")

ValueError: Digitalized labels does not have elements close enough to bin index 95. The bin index should be in the range of the labels values.

Anything else?

The problematic bin_index varies depending on the discretisation of the position / labels

Code of Conduct

[X] I agree to follow this project's Code of Conduct

stes commented 1 year ago

Thanks for reporting -- as a quick check, can you avoid the error by lowering the number of bins?

drsax93 commented 1 year ago

Hi, thanks for the quick reply ! I was planning to give it a go, but I have to double check how to change a function within a package. Should I just change from the main directory and then run
python setup.py install right?

Cheers,

Giuseppe

On Jun 23 2023, at 5:36 PM, Steffen Schneider @.***> wrote:

Thanks for reporting -- as a quick check, can you avoid the error by lowering the number of bins?

—

Reply to this email directly, view it on GitHub (https://github.com/AdaptiveMotorControlLab/CEBRA/issues/24#issuecomment-1604528044), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AHQLVO2YRTIKUIXFIEG4FP3XMXAZ3ANCNFSM6AAAAAAZRZAPEU).

You are receiving this because you authored the thread.

stes commented 1 year ago

Easiest is to clone the repo and do a local install doing

pip install -e .

We might consider exposing the number of bins to the API in the future -- thanks for catching this!

drsax93 commented 1 year ago

Ok, I’ll let you know how that goes!

On Jun 23, 2023 at 5:44 PM, <Steffen Schneider @.***)> wrote:

Easiest is to clone the repo and do a local install doing

pip install -e .

We might consider exposing the number of bins to the API in the future -- thanks for catching this!

— Reply to this email directly, view it on GitHub (https://github.com/AdaptiveMotorControlLab/CEBRA/issues/24#issuecomment-1604537878), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AHQLVO5OQC2ZJ2FPEH7GGC3XMXBXFANCNFSM6AAAAAAZRZAPEU). You are receiving this because you authored the thread.Message ID: @.***>

stes commented 1 year ago

PR #25 now contains a suggestion --- let me know if that fixes your issue.

drsax93 commented 1 year ago

Changing the number of bins works, thanks! Could you comment on how to choose the appropriate number?

Reading through the consistemcy score demo it says Correlation matrices depict the after fitting a linear model between behavior-aligned embeddings of two animals, one as the target one as the source (mean, n=10 runs), but I don't see any shuffling / subsampling procedure in the code -- is it so?

Cheers

stes commented 1 year ago

@drsax93 ,

Could you comment on how to choose the appropriate number?

An appropriate number of bins would be one that you could also use to plot a histogram of your data; there should be no empty bins (this is what caused your original error), but there should not be too few (in the extreme case, a single bin would cause the consistency to be always at 100%).

So ideally try to find the largest number of bins that avoid the issue that you saw above for best results.

Reading through the consistemcy score demo it says Correlation matrices depict the after fitting a linear model between behavior-aligned embeddings of two animals, one as the target one as the source (mean, n=10 runs), but I don't see any shuffling / subsampling procedure in the code -- is it so?

The runs are with respect to fitting 10 independent CEBRA models. This is sth you have to do as an input for that function. I.e., you would fit 10 models (in the simplest case, just running through a for loop), compute the embeddings, and pass the results to the function.

Does that make sense?

drsax93 commented 1 year ago

Yes, thanks!

I interpreted the text as suggesting that the score was obtained on subsamples, but the way you explained just now makes sense.

On Jun 26 2023, at 2:50 PM, Steffen Schneider @.***> wrote:

@drsax93 (https://github.com/drsax93) ,

Could you comment on how to choose the appropriate number?

An appropriate number of bins would be one that you could also use to plot a histogram of your data; there should be no empty bins (this is what caused your original error), but there should not be too few (in the extreme case, a single bin would cause the consistency to be always at 100%).

So ideally try to find the largest number of bins that avoid the issue that you saw above for best results.

Reading through the consistemcy score demo it says Correlation matrices depict the after fitting a linear model between behavior-aligned embeddings of two animals, one as the target one as the source (mean, n=10 runs), but I don't see any shuffling / subsampling procedure in the code -- is it so?

The runs are with respect to fitting 10 independent CEBRA models. This is sth you have to do as an input for that function. I.e., you would fit 10 models (in the simplest case, just running through a for loop), compute the embeddings, and pass the results to the function.

Does that make sense?

—

Reply to this email directly, view it on GitHub (https://github.com/AdaptiveMotorControlLab/CEBRA/issues/24#issuecomment-1607516568), or unsubscribe (https://github.com/notifications/unsubscribe-auth/AHQLVO6IMIYMXBANFUGQM2TXNGHUBANCNFSM6AAAAAAZRZAPEU).

You are receiving this because you were mentioned.

AdaptiveMotorControlLab / CEBRA