emdann / milopy

Python implementation of Milo for differential abundance testing on KNN graph
MIT License
60 stars 7 forks source link

milo.DA_nhoods "ValueError" covariates cannot be unambiguously assgined to each sample #47

Closed quiquevb23 closed 3 months ago

quiquevb23 commented 3 months ago

I am using Milopy to run DA analysis on a single cell dataset, I have followed the steps in the tutorial, but at the stage of running the following line: milo.DA_nhoods(adata, design="~cluster_continuous") I get the error ValueError: Covariates cannot be unambiguously assigned to each sample -- each sample value should match a single covariate value

I have 28 clusters identified with Leiden and 7 samples (conditions), and I want to test for Differential Abundance of clusters for these conditions, assuming that each of this 7 samples are unique, meaning there are no replicates for each condition.

AssertionError Traceback (most recent call last) File ~/anaconda3/envs/scanpy-env/lib/python3.12/site-packages/milopy/core.py:209, in DA_nhoods(adata, design, model_contrasts, subset_samples, add_intercept) 208 try: --> 209 assert nhoods_var.loc[nhood_adata.var_names].shape[0] == len( 210 nhood_adata.var_names) 211 except:

AssertionError:

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last) Cell In[67], line 1 ----> 1 milo.DA_nhoods(adata, design="~cluster_continuous")

File ~/anaconda3/envs/scanpy-env/lib/python3.12/site-packages/milopy/core.py:212, in DA_nhoods(adata, design, model_contrasts, subset_samples, add_intercept) 209 assert nhoods_var.loc[nhood_adata.var_names].shape[0] == len( 210 nhood_adata.var_names) 211 except: --> 212 raise ValueError( 213 "Covariates cannot be unambiguously assigned to each sample -- each sample value should match a single covariate value") 214 nhood_adata.var = nhoods_var.loc[nhood_adata.var_names] 215 # Get design dataframe

ValueError: Covariates cannot be unambiguously assigned to each sample -- each sample value should match a single covariate value

emdann commented 3 months ago

Hi @quiquevb23, this error hints at the fact that you have cells from multiple samples in different clusters, therefore you can't use Milo to test for differential abundance between clusters. When you run count_nhoods you are making a matrix where rows corresponds to cell neighbourhoods and columns correspond to your samples. The aim of the DA test is to identify neighbourhoods where there are differences in cell numbers between samples in different conditions of interest (e.g. is the number of cells in neighbourhood A significantly higher in samples from disease patients compared to samples from healthy patients?). This means that (a) each sample (column in the count matrix) needs to be assigned to just one level of the condition specified in the design, which you can't do for clusters, and (b) you need multiple samples per condition or the model has no way of knowing whether the differences in cell numbers are significant.

Based on how you describe your question, there are a few things you could do:

  1. Test for differences between conditions in neighbourhoods (so switch design to ~condition), and then look for the clusters in which enriched/depleted neighbourhoods fall in using annotate_nhoods - but you do need at least 2 samples/replicates per condition
  2. Assign neighbourhoods to clusters with annotate_nhoods, then just inspect the number of cells in the nhood x sample matrix (you can access this in adata.uns['nhood_adata'].X)
  3. Use a differential abundance method that tests for differences between cells in clusters, such as scCODA (see guided examples here)

As a side note, the functionality in this package has now been re-implemented and optimized in https://github.com/scverse/pertpy (tutorial). I will soon archive this package, so I recommend switching to pertpy which will remain actively maintained and updated.

quiquevb23 commented 3 months ago

Okay thank you so much, I now understand that for DA with clusters I would need to use other method as you pointed out, however I wonder if it is still possible to use Milo with my covariate variable being "injury" but having only 2 outcomes: control vs injured. Is 2 levels for the condition enough for Milo to treat it as a continuous variable, or more levels are required (for example time points after injury or severity degree)? In the case where only 1 sample per condition can be supplied, do you think there is a way to still use Milo with a statistical treatment, like some sort of randomization? And one last question, if my experimental design also includes a treated group and I want to explore the neighborhood enrichment for these covariate, can Milo handle this, since now the continuous variable does not only have "healthy" and "injured" outcomes, but also "treated"? I will check pertpy too, thank you again for your detailed answer!