dattalab / keypoint-moseq

https://keypoint-moseq.readthedocs.io
Other
63 stars 25 forks source link

plotting functions to compare between groups error at several steps #140

Closed mshallow closed 2 months ago

mshallow commented 3 months ago

Using the Statistical Analysis jupyter notebook (analysis.ipynb) there are several errors that occur with the different plotting functions to compare between groups. The first error encountered is that plot_syll_stats_with_sem errors every time, regardless of the data included in the stats_df and gives a KeyError stating that ['syllable'] doesn't exist in the dataframe being inputted. Below is the code run and the error output:

kpms.plot_syll_stats_with_sem(
    stats_df, project_dir, model_name,
    plot_sig=True,    # whether to mark statistical significance with a star
    thresh=0.05,      # significance threshold
    stat='frequency', # statistic to be plotted (e.g. 'duration' or 'velocity_px_s_mean')
    order='stat',     # order syllables by overall frequency ("stat") or degree of difference ("diff")
    ctrl_group='pc_l_nl',   # name of the control group for statistical testing
    exp_group='pc_l_l',    # name of the experimental group for statistical testing
    figsize=(8, 4),   # figure size    
    groups=stats_df['group'].unique(), # groups to be plotted
);
KeyError                                  Traceback (most recent call last)
Cell In[18], line 1
----> 1 kpms.plot_syll_stats_with_sem(
      2     stats_df, project_dir, model_name,
      3     plot_sig=True,    # whether to mark statistical significance with a star
      4     thresh=0.05,      # significance threshold
      5     stat='duration', # statistic to be plotted (e.g. 'duration' or 'velocity_px_s_mean')
      6     order='stat',     # order syllables by overall frequency ("stat") or degree of difference ("diff")
      7     ctrl_group='pc_l_nl',   # name of the control group for statistical testing
      8     exp_group='pc_l_l',    # name of the experimental group for statistical testing
      9     figsize=(8, 4),   # figure size    
     10     groups=stats_df['group'].unique(), # groups to be plotted
     11 );

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/keypoint_moseq/analysis.py:1119, in plot_syll_stats_with_sem(stats_df, project_dir, model_name, save_dir, plot_sig, thresh, stat, order, groups, ctrl_group, exp_group, colors, join, figsize)
   1115 sig_sylls = None
   1117 if plot_sig and len(stats_df["group"].unique()) > 1:
   1118     # run kruskal wallis and dunn's test
-> 1119     _, _, sig_pairs = run_kruskal(stats_df, statistic=stat, thresh=thresh)
   1120     # plot significant syllables for control and experimental group
   1121     if ctrl_group is not None and exp_group is not None:
   1122         # check if the group pair is in the sig pairs dict

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/keypoint_moseq/analysis.py:883, in run_kruskal(stats_df, statistic, n_perm, seed, thresh, mc_method)
    881 df_z = pd.DataFrame(real_zs_within_group)
    882 df_z.index = df_z.index.set_names("syllable")
--> 883 dunn_results_df = df_z.reset_index().melt(id_vars="syllable")
    885 # Get intersecting significant syllables between
    886 intersect_sig_syllables = {}

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/pandas/core/frame.py:9915, in DataFrame.melt(self, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
   9905 @Appender(_shared_docs["melt"] % {"caller": "df.melt(", "other": "melt"})
   9906 def melt(
   9907     self,
   (...)
   9913     ignore_index: bool = True,
   9914 ) -> DataFrame:
-> 9915     return melt(
   9916         self,
   9917         id_vars=id_vars,
   9918         value_vars=value_vars,
   9919         var_name=var_name,
   9920         value_name=value_name,
   9921         col_level=col_level,
   9922         ignore_index=ignore_index,
   9923     ).__finalize__(self, method="melt")

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/pandas/core/reshape/melt.py:74, in melt(frame, id_vars, value_vars, var_name, value_name, col_level, ignore_index)
     70 if missing.any():
     71     missing_labels = [
     72         lab for lab, not_found in zip(labels, missing) if not_found
     73     ]
---> 74     raise KeyError(
     75         "The following id_vars or value_vars are not present in "
     76         f"the DataFrame: {missing_labels}"
     77     )
     78 if value_vars_was_not_none:
     79     frame = frame.iloc[:, algos.unique(idx)]

KeyError: "The following id_vars or value_vars are not present in the DataFrame: ['syllable']"

The other issue is with the plotting is that after applying a checkpoint of the model to new data, and then adding these condition and syllable labels to the overall dataset, generate_transition_matrices errors. It seems to be due to indexing of the syllables, since applying the model to new data generates additional syllables, but also prunes out ones that appear to be noise so the non-sequential indices don't match the length of the array/ list of indices. The code and error are below:

normalize='bigram' # normalization method ("bigram", "rows" or "columns")

trans_mats, usages, groups, syll_include=kpms.generate_transition_matrices(
    project_dir, model_name, normalize=normalize,
    min_frequency=0.005 # minimum syllable frequency to include
)    

kpms.visualize_transition_bigram(
    project_dir, model_name, groups, trans_mats, syll_include, normalize=normalize, 
    show_syllable_names=True, figsize=(25,10) # label syllables by index (False) or index and name (True)
)
Group(s): exp_l, pc_d_l, pc_d_nl, pc_l_l, pc_l_nl
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[9], line 3
      1 normalize='bigram' # normalization method ("bigram", "rows" or "columns")
----> 3 trans_mats, usages, groups, syll_include=kpms.generate_transition_matrices(
      4     project_dir, model_name, normalize=normalize,
      5     min_frequency=0.005 # minimum syllable frequency to include
      6 )    
      8 kpms.visualize_transition_bigram(
      9     project_dir, model_name, groups, trans_mats, syll_include, normalize=normalize, 
     10     show_syllable_names=True, figsize=(25,10) # label syllables by index (False) or index and name (True)
     11 )

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/keypoint_moseq/analysis.py:1492, in generate_transition_matrices(project_dir, model_name, normalize, min_frequency)
   1489 frequencies = get_frequencies(model_labels)
   1490 syll_include = np.where(frequencies > min_frequency)[0]
-> 1492 trans_mats, usages = get_group_trans_mats(
   1493     model_labels,
   1494     label_group,
   1495     group,
   1496     syll_include=syll_include,
   1497     normalize=normalize,
   1498 )
   1499 return trans_mats, usages, group, syll_include

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/keypoint_moseq/analysis.py:1386, in get_group_trans_mats(labels, label_group, group, syll_include, normalize)
   1379     trans_mats.append(
   1380         get_transition_matrix(use_labels, normalize=normalize, combine=True)[
   1381             syll_include, :
   1382         ][:, syll_include]
   1383     )
   1385     # Getting frequency information for node scaling
-> 1386     group_frequencies = get_frequencies(use_labels)[syll_include]
   1388     frequencies.append(group_frequencies)
   1389 return trans_mats, frequencies

IndexError: index 57 is out of bounds for axis 0 with size 48
afbrokaw commented 3 months ago

Just hopping in to say I am having the same issue with not being able to plot comparing two groups (first part of this question). It is not clear why, as when I pull up stats_df, it has a variable named 'syllable'. Is it linked to the option to label syllables? I don't currently need this step and it is a bit frustrating that so many of these plotting options rely on having run that function first.

If that is not the case, I have found that setting plot_sig=False still produces the same error, but at least also produces a figure of some kind (just without statistical significance). Looking at this figure (see attached), could this error be because some syllables are detected in one group and not the other? For example, I have gaps in the group SAL where syllable 8 is not plotted but is plotted for group MMZ. Screenshot 2024-03-15 125517

versey-sherry commented 3 months ago

Could you try the dev branch and see if the problem is fixed

afbrokaw commented 3 months ago

Github and Python newbie - how do I do that?

amorsi1 commented 3 months ago

Could you try the dev branch and see if the problem is fixed

I've just tested it out in both the dev branch and the krukal_fix branch without any luck. It works perfectly fine with plot_sig=False in all cases I've tried though.

versey-sherry commented 3 months ago

Github and Python newbie - how do I do that?

https://www.squash.io/how-to-pip-install-from-a-git-repo-branch/ @afbrokaw

versey-sherry commented 3 months ago

I was not able to reproduce your errors earlier. Did you change the ctrl_group='a' and exp_group='b' parameters in the function? If not, the function would assume you intend to plot group a and group b but they are not in your dataset, hence ignore the plot_sig=True flag.

Now I have changed the behavior of this function, such that if the groups in ctrl_group and exp_group are not the real groups in the dataset, all the group significant syllables would be plotted. Please try this branch and see if it works for you: https://github.com/dattalab/keypoint-moseq/tree/analysis_fix

@amorsi1 when you said "without any luck" do you mean the function errors or there are no significant syllables plotted? Also, in Slack you said the 'kruskal_fix' branch worked for you but turned out it didn't work? Could you elaborate @afbrokaw did you have the same errors as @mshallow or just no significant syllables plotted? @mshallow could you check if https://github.com/dattalab/keypoint-moseq/tree/analysis_fix works for you?

Thank you everyone.

IsabelleSajonia commented 3 months ago

Hi, I'm having the same error where plot_sig=False plots the figure but I get that error if its set to true. I did change 'a' and 'b' to my real group names.

versey-sherry commented 3 months ago

@IsabelleSajonia Could you try installing keypoint-moseq on this branch?

Instruction for installation here: https://www.squash.io/how-to-pip-install-from-a-git-repo-branch/

Hi, I'm having the same error where plot_sig=False plots the figure but I get that error if its set to true. I did change 'a' and 'b' to my real group names.

IsabelleSajonia commented 3 months ago

image Like this? It looks like I'm still not getting significance but I'm not sure if I am in the right branch. Also I only have one video per group not sure if that creates a problem with the significance calculation

versey-sherry commented 3 months ago

Can you output your stats_df as csv and send it to sherrylin42@gmail.com

It is possible that your dataset is not significant in any syllables.

On Tue, Mar 26, 2024 at 12:57 PM Isabelle @.***> wrote:

image.png (view on web) https://github.com/dattalab/keypoint-moseq/assets/70863857/a48df079-30b5-4f03-8727-891965c0142d Like this? It looks like I'm still not getting significance but I'm not sure if I am in the right branch. Also I only have one video per group not sure if that creates a problem with the significance calculation

— Reply to this email directly, view it on GitHub https://github.com/dattalab/keypoint-moseq/issues/140#issuecomment-2020984493, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECDADHVESLL7YFSKHPFI7TY2GSGTAVCNFSM6AAAAABES7CYUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQHE4DINBZGM . You are receiving this because you commented.Message ID: @.***>

-- Best Regards, Sherry

IsabelleSajonia commented 3 months ago

Where should I send it? And good point it may be that its not significant at all. Syllable 9, 14, and 17 do not seem to occur at all in one group so I thought maybe there would be some significance. I'm going to have to do my own significance calculation at some point so no worries image

versey-sherry commented 3 months ago

Looks like GitHub masked my email. It is sherrylin42@gmail.com

You mentioned earlier you only had one group, how do you go about statistical testing when you are just comparing two animals without any distribution?

IsabelleSajonia commented 3 months ago

Yes I just realized after my first comment about the significance values that I only input two videos, so I'll try again with more than one per group and see if I can visualize with that. Sorry for the confusion

mshallow commented 3 months ago

https://github.com/dattalab/keypoint-moseq/tree/analysis_fix Hi @versey-sherry I just tried the analysis_fix branch with my data, and it made it past the error that I was encountering before, but even with changing the control and experimental group to my group labels, I still run into an error. I have more that two videos in my input (I have ~10 per condition in this test dataset). I've quoted the error I'm getting now below:

Code run: `dark_cond=['pc_d_nl','pc_d_l'] light_cond=['pc_l_nl','pc_l_l'] kpms.plot_syll_stats_with_sem( stats_df, project_dir, model_name, plot_sig=True, # whether to mark statistical significance with a star thresh=0.05, # significance threshold stat='frequency', # statistic to be plotted (e.g. 'duration' or 'velocity_px_s_mean') order='stat', # order syllables by overall frequency ("stat") or degree of difference ("diff") ctrl_group='pc_d_nl', # name of the control group for statistical testing exp_group='pc_d_l', # name of the experimental group for statistical testing join=True, figsize=(8, 4), # figure size
groups=dark_cond

groups=stats_df['group'].unique(), # groups to be plotted

);`

Error message: `Users/mollyshallow/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/seaborn/_base.py:948: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass (name,) instead of name to silence this warning.

/Users/mollyshallow/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/seaborn/_base.py:948: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass (name,) instead of name to silence this warning.


ValueError Traceback (most recent call last) Cell In[9], line 3 1 dark_cond=['pc_d_nl','pc_d_l'] 2 light_cond=['pc_l_nl','pc_l_l'] ----> 3 kpms.plot_syll_stats_with_sem( 4 stats_df, project_dir, model_name, 5 plot_sig=True, # whether to mark statistical significance with a star 6 thresh=0.05, # significance threshold 7 stat='frequency', # statistic to be plotted (e.g. 'duration' or 'velocity_px_s_mean') 8 order='stat', # order syllables by overall frequency ("stat") or degree of difference ("diff") 9 ctrl_group='pc_d_nl', # name of the control group for statistical testing 10 exp_group='pc_d_l', # name of the experimental group for statistical testing 11 join=True, 12 figsize=(8, 4), # figure size
13 groups=dark_cond 14 # groups=stats_df['group'].unique(), # groups to be plotted 15 );

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/keypoint_moseq/analysis.py:1197, in plot_syll_stats_with_sem(stats_df, project_dir, model_name, save_dir, plot_sig, thresh, stat, order, groups, ctrl_group, exp_group, colors, join, figsize) 1195 else: 1196 continue -> 1197 markings = np.concatenate(markings) 1198 plt.scatter(markings, [-0.05] len(markings), color="r", marker="") 1200 # manually define a new patch

File <__array_function__ internals>:180, in concatenate(*args, **kwargs)

ValueError: need at least one array to concatenate`

I can also email you my stats df if that would be helpful!

versey-sherry commented 3 months ago

https://github.com/dattalab/keypoint-moseq/tree/analysis_fix Hi @versey-sherry I just tried the analysis_fix branch with my data, and it made it past the error that I was encountering before, but even with changing the control and experimental group to my group labels, I still run into an error. I have more that two videos in my input (I have ~10 per condition in this test dataset). I've quoted the error I'm getting now below:

Code run: dark_cond=['pc_d_nl','pc_d_l'] light_cond=['pc_l_nl','pc_l_l'] kpms.plot_syll_stats_with_sem( stats_df, project_dir, model_name, plot_sig=True, # whether to mark statistical significance with a star thresh=0.05, # significance threshold stat='frequency', # statistic to be plotted (e.g. 'duration' or 'velocity_px_s_mean') order='stat', # order syllables by overall frequency ("stat") or degree of difference ("diff") ctrl_group='pc_d_nl', # name of the control group for statistical testing exp_group='pc_d_l', # name of the experimental group for statistical testing join=True, figsize=(8, 4), # figure size groups=dark_cond # groups=stats_df['group'].unique(), # groups to be plotted );

Error message: `Users/mollyshallow/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/seaborn/_base.py:948: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass (name,) instead of name to silence this warning.

/Users/mollyshallow/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/seaborn/_base.py:948: FutureWarning:

When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass (name,) instead of name to silence this warning.

ValueError Traceback (most recent call last) Cell In[9], line 3 1 dark_cond=['pc_d_nl','pc_d_l'] 2 light_cond=['pc_l_nl','pc_l_l'] ----> 3 kpms.plot_syll_stats_with_sem( 4 stats_df, project_dir, model_name, 5 plot_sig=True, # whether to mark statistical significance with a star 6 thresh=0.05, # significance threshold 7 stat='frequency', # statistic to be plotted (e.g. 'duration' or 'velocity_px_s_mean') 8 order='stat', # order syllables by overall frequency ("stat") or degree of difference ("diff") 9 ctrl_group='pc_d_nl', # name of the control group for statistical testing 10 exp_group='pc_d_l', # name of the experimental group for statistical testing 11 join=True, 12 figsize=(8, 4), # figure size 13 groups=dark_cond 14 # groups=stats_df['group'].unique(), # groups to be plotted 15 );

File ~/opt/anaconda3/envs/keypoint_moseq/lib/python3.10/site-packages/keypoint_moseq/analysis.py:1197, in plot_syll_stats_with_sem(stats_df, project_dir, model_name, save_dir, plot_sig, thresh, stat, order, groups, ctrl_group, exp_group, colors, join, figsize) 1195 else: 1196 continue -> 1197 markings = np.concatenate(markings) 1198 plt.scatter(markings, [-0.05] len(markings), color="r", marker="") 1200 # manually define a new patch

File <array_function internals>:180, in concatenate(*args, **kwargs)

ValueError: need at least one array to concatenate`

I can also email you my stats df if that would be helpful!

Thank you! I just pushed a hotfix, could you pull the latest changes and have a look? Thanks!

mshallow commented 3 months ago

Yep, I'll try that right now!

mshallow commented 3 months ago

That fixed the significance marking issue! However, I'm still having issues running the generate_transition_matrices, which seems to have started once I updated the dataset to include trials that weren't in the initial training dataset.

calebweinreb commented 2 months ago

@versey-sherry is this resolved? Also have the relevant changes been merged into main?