aditisk commented 1 year ago

Dear miloDE team,

Thank you so much for developing this package ! We are looking forward to utilizing this on our dataset. I started by following the steps in vignette. Everything works perfectly until I reach the AUC calculation portion. I am getting an error message as pasted below:

stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo , sample_id = "Patient" , condition_id = "milo", min_n_cells_per_sample = 100)) Error in mutate(): ! Problem while computing metrics = pred %>% map(metric_fun). Caused by error in metric_set(): ! Failed to compute roc_auc(). Caused by error in parse(): ! :1:7: unexpected symbol 1: Using an ^ Run rlang::last_error() to see where the error occurred.

I tried to run last_error() and the output is pasted below:

rlang::last_error() <error/dplyr:::mutate_error> Error in mutate(): ! Problem while computing metrics = pred %>% map(metric_fun). Caused by error in metric_set(): ! Failed to compute roc_auc(). Caused by error in parse(): ! :1:7: unexpected symbol 1: Using an ^

Backtrace:

base::suppressWarnings(...)

purrr::map(., metric_fun)

Augur (local) .f(.x[[i]], ...)

yardstick (local) multi_metric(x, truth = true, estimate = pred, prob_select, estimator = estimator)

base::mapply(...)

yardstick (local) <fn>(dots[[1L]][[1L]], dots[[2L]][[1L]])

Could you please help me figure out how to fix this ? Thanks.

amissarova commented 1 year ago

Hey, couple of suggestions to try/confirm - let me know if either of them work:

Just want to make sure that your argument is correct - condition_id should be a name of the field from colData(sce_milo) that specifies to which condition your cells belong (e.g. the usual/standard names would be names as disease, perturbation_status etc). Im a bit surprised that it is called milo so just want to make sure that this is correct.
Does it work for lower min_n_cells_per_sample? Can you try something like 5-10. If that fixes the problem, then I assume it is possible that you never have neighbourhoods with 100 cells per sample and this part of the algorithm can not work, since if takes no cells to compare.

If that doesn't help, then I assume the problem is with the Augur's calculate_auc function itself (in some weird edge cases it doesn't work for me as well). Let me know and maybe you could share some of the data, and I can try to reproduce the error.

Meanwhile, you could also proceed wo this step since it is optional anyway and not essential for the main part of the algorithm (DE testing itself).

aditisk commented 1 year ago

Hi @amissarova, thank you so much for the quick response.

Yes, "milo" is a valid column name in my object. This is basically a temporary metadata column I added while I was running the original milo algorithm. It has a selection of cells I'm interested in and everything else is just called "other cells". Sorry about the confusion.
I am using min_n_cells_per_sample=1 so that's probably not the issue.

These data are unpublished so unfortunately I won't be able to share them. If you can share which edge cases it doesn't work for you, maybe I can check if that is the issue with my data.

Could you suggest a different way to subset neighborhoods to perform DE testing since the AUC filter is not an option ? I have ~370 neighborhoods detected so i would like to filter them before proceeding with the DE testing.

Thanks a lot for your help.

amissarova commented 1 year ago

So it doesnt work with neither min_n_cells_per_sample =1 nor min_n_cells_per_sample =100?

aditisk commented 1 year ago

I apologize, in my response above I typed the wrong number, I was trying a bunch of options and got confused.

I just double checked with values 1, 10 and 100 and got the same error each time.

amissarova commented 1 year ago

Can you please pass n_threads =1 and send me the error you are getting then? (since Augur in itself employs multithreading, it is v hard to parse error messages - setting n_threads to 1 will help me to understand more specifically what is the error)

aditisk commented 1 year ago

Here you go:

stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo , sample_id = "Patient" , condition_id = "milo", min_n_cells_per_sample = 1,n_threads =1)) Error in mutate(): ! Problem while computing metrics = pred %>% map(metric_fun). Caused by error in metric_set(): ! Failed to compute roc_auc(). Caused by error in parse(): ! :1:7: unexpected symbol 1: Using an ^ Run rlang::last_error() to see where the error occurred.

amissarova commented 1 year ago

Can i also confirm whether each Patient is associated with only one condition? i.e. all the cells from the same patient belong to only one condition (based on milo column)?

aditisk commented 1 year ago

No they are not, the milo column has 2 groups but I have a total of 35 patients (7 in one group and the remaining in the other according to the milo column).

amissarova commented 1 year ago

Ok, so I'm not entirely sure what happens, but: one possibility is that in some neighbourhoods, the total number of cells in at least one condition is less than 4 (in this case, Augur's calculate_auc function can not run; initially I had min_cells threshold set to 3, when it had to be 4). However, this error could only occur if min_n_cells_per_sample < 4 so I'm really stumbled re why it is not working for thresholds higher than that (you say you get the same error for 10 and 100, right?).

Anyways - I now updated this function and set the threshold to 4. Can you please download package again and let me know if that fixes the problem? Also, can you try to run it with and without parallelization and see if any of those options returns error?

aditisk commented 1 year ago

I reinstalled the package and tried with and without parallelization and the outputs are pasted below:

With parallelization

stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo , sample_id = "Patient" , condition_id = "milo", min_n_cells_per_sample = 10,n_threads =1,BPPARAM = mcparam)) Stop worker failed with the error: wrong args for environment subassignment Error: BiocParallel errors 0 remote errors, element index: 321 unevaluated and other errors first remote error:

Without parallelization

stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo , sample_id = "Patient" , condition_id = "milo", min_n_cells_per_sample = 10)) Error in mutate(): ! Problem while computing metrics = pred %>% map(metric_fun). Caused by error in metric_set(): ! Failed to compute roc_auc(). Caused by error in parse(): ! :1:7: unexpected symbol 1: Using an ^ Run rlang::last_error() to see where the error occurred.

amissarova commented 1 year ago

Ok, so that wasn't the issue. Let's try to debug it step by step. Given the error, I would assume the mistake happens when we try to call Augur::calculate_auc. Conceptually, there are 2 possibilities: either something is not working in general for all of the neighbourhoods or something is off with some of the neighbourhoods, and then would be good to identify which neighbourhoods so we can zoom into them.

I suggest we try to run this function step by step. I will paste below chanks of the code, can you please try them in this order and let me know at which point it breaks:

1: redefine some of the sub-functions. This should work with no errors, but let me know if it doesn't:

filter_samples_with_low_n_cells_in_hood = function(x , min_n_cells_per_sample = 1){
  tab = table(x$milo_sample_id)
  samples_2_keep = names(tab)[tab >= min_n_cells_per_sample]
  x = x[, x$milo_sample_id %in% samples_2_keep]
  return(x)
}

get_auc_single_hood = function(x , nhoods_sce , hood_id , min_cells = 4 , min_n_cells_per_sample){
  require(Augur)
  require(SummarizedExperiment)
  require(SingleCellExperiment)
  # select cells
  current.cells = which(nhoods_sce[,hood_id] == 1)
  current.cells = rownames(nhoods_sce)[current.cells]
  current.sce = x[,colnames(x) %in% current.cells]
  current.sce = filter_samples_with_low_n_cells_in_hood(current.sce , min_n_cells_per_sample = min_n_cells_per_sample)

  if (ncol(current.sce) > 0){
    current.sce$celltype.dummy = "dummy"
    meta = as.data.frame(colData(current.sce))
    tab = table(as.character(current.sce$milo_condition_id))
    if (length(tab) == 2 & tab[1] >= min_cells & tab[2] >= min_cells){
      auc = calculate_auc(logcounts(current.sce), meta, cell_type_col = "celltype.dummy",
                          label_col = "milo_condition_id" , n_subsamples = 0 ,
                          subsample_size = min_cells , min_cells = min_cells ,
                          feature_perc = 1 , n_threads = 1 , show_progress = FALSE)
      out = as.data.frame(auc$AUC)
      out$auc_calculated = TRUE

    } else {
      out = data.frame(cell_type = "dummy" , auc = NaN , auc_calculated = FALSE)
    }
  } else {
    out = data.frame(cell_type = "dummy" , auc = NaN , auc_calculated = FALSE)
  }
  out$Nhood_center = hood_id
  return(out)
}

Set up the data to be able to run AUC.


library(miloDE)
library(miloR)
x = sce_milo
sample_id = "Patient"
condition_id = "milo"
# lets try w min_n_cells_per_sample = 1
min_n_cells_per_sample = 1

if (is.null(colnames(x))){ colnames(x) = as.character(c(1:ncol(x))) }

assign condition and sample ids

coldata <- as.data.frame(colData(x)) x$milo_sample_id <- as.factor( coldata[, sample_id] ) x$milo_condition_id <- as.factor( coldata[, condition_id] ) nhoods_sce = nhoods(x)

filter out for relevant cells

current_cols = colnames(nhoods_sce) nhoods_sce = as.matrix( nhoods_sce[rownames(nhoods_sce) %in% colnames(x), ] ) colnames(nhoods_sce) = current_cols

delete neighbourhoods that contain 0 cells (possible due to the upstream filtering)

idx = which(colSums(nhoods_sce) > 0) current_cols = colnames(nhoods_sce)[idx] nhoods_sce = as.matrix( nhoods_sce[, idx] ) colnames(nhoods_sce) = current_cols


3. Now lets try to run Augur itself. I think that's where you will have a break. We gonna print out neighbourhood's IDs so we can pinpoint for which neighbourhood it doesn't work. Please let me know: a) does this chunk finishing working; b) if it breaks - at which neighbourhood does it break (is it the first one or no). c) Can you register the hood.id at which it breaks as well? P.s. there are probably better ways to debug this, but that's how I usually would go around it.

auc_stat = lapply(colnames(nhoods_sce) , function(hood_id){ print(hood_id) out = get_auc_single_hood(x , nhoods_sce , hood_id , min_cells = 4 , min_n_cells_per_sample = min_n_cells_per_sample) return(out) }) auc_stat = do.call(rbind , auc_stat)


4. Assume it doesn't work for a certain neighbourhood (you should know after running previous part, for which neighbourhood). So for the neighbourhood for which it breaks, can you run this and print out the outputs of this code:

paste here your neighbourhood id

hood.id = @@paste here your neighbourhood id current.cells = rownames(nhoods_sce)[which(nhoods_sce[,hood_id] == 1)] current.sce = x[,colnames(x) %in% current.cells] current.sce = filter_samples_with_low_n_cells_in_hood(current.sce , min_n_cells_per_sample = min_n_cells_per_sample) table(current.sce$milo_sample_id , current.sce$milo_condition_id) class(current.sce$milo_sample_id) class(current.sce$milo_condition_id)

aditisk commented 1 year ago

Hi, I think the issue is while running Augur. I can't tell for sure but I think it doesn't finish running.

auc_stat = lapply(colnames(nhoods_sce) , function(hood_id){ print(hood_id) out = get_auc_single_hood(x , nhoods_sce , hood_id , min_cells = 4 , min_n_cells_per_sample = min_n_cells_per_sample) return(out) }) [1] "38641" Error in mutate(): ! Problem while computing metrics = pred %>% map(metric_fun). Caused by error in metric_set(): ! Failed to compute roc_auc(). Caused by error in parse(): ! :1:7: unexpected symbol 1: Using an ^ Run rlang::last_error() to see where the error occurred.

It does print 38641 but that is not a valid nhood ID, I even checked this with the code in step 4 above: current.cells = rownames(nhoods_sce)[which(nhoods_sce[,hood_id] == 1)] Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'which': object 'hood_id' not found

amissarova commented 1 year ago

You need first initialise hood_id. Can you please print out outputs of these entries (if it breaks at some point. let me know and keep printing outputs of next entries):

hood_id = "38641"
print(sce_milo)
nhoods_sce = nhoods(sce_milo)
class(colnames(sce_milo))
head(colnames(sce_milo))
head(rownames(nhoods_sce))
head(colnames(nhoods_sce))
mean(colnames(sce_milo) == rownames(nhoods_sce))
current.cells = rownames(nhoods_sce)[which(nhoods_sce[,hood_id] == 1)]
head(current.cells)
current.sce = sce_milo[,colnames(sce_milo) %in% current.cells]
print(current.sce)
table(current.sce$milo_sample_id , current.sce$milo_condition_id)

aditisk commented 1 year ago

Sorry I didn't realize the hood_id needs to be initialzed with quotes. Previously I just did 38641 but added the quotes now and the output is pasted below:

`

hood_id = "38641" print(sce_milo) class: Milo dim: 33545 58582 metadata(0): assays(2): counts logcounts rownames(33545): MIR1302-2HG FAM138A ... E6 E7 rowData names(0): colnames(58582): 103_pre_T_AAACCTGCAAGCTGAG-1 103_pre_T_AAACCTGCACAAGCCC-1 ... 140_pre_T_TTTGTCATCAATAAGG-1 140_pre_T_TTTGTCATCCTTGGTC-1 colData names(32): pt_ID nUMI ... milo ident reducedDimNames(2): UMAP PCA mainExpName: RNA altExpNames(0): nhoods dimensions(2): 58582 321 nhoodCounts dimensions(2): 1 1 nhoodDistances dimension(1): 0 graph names(1): graph nhoodIndex names(1): 321 nhoodExpression dimension(2): 1 1 nhoodReducedDim names(0): nhoodGraph names(1): nhoodGraph nhoodAdjacency dimension(2): 321 321 nhoods_sce = nhoods(sce_milo) class(colnames(sce_milo)) [1] "character" head(colnames(sce_milo)) [1] "103_pre_T_AAACCTGCAAGCTGAG-1" "103_pre_T_AAACCTGCACAAGCCC-1" "103_pre_T_AAACCTGGTCCGTTAA-1" "103_pre_T_AAACCTGGTTTGTTGG-1" "103_pre_T_AAACCTGTCCGTAGGC-1" [6] "103_pre_T_AAACGGGCAAAGAATC-1" head(rownames(nhoods_sce)) [1] "103_pre_T_AAACCTGCAAGCTGAG-1" "103_pre_T_AAACCTGCACAAGCCC-1" "103_pre_T_AAACCTGGTCCGTTAA-1" "103_pre_T_AAACCTGGTTTGTTGG-1" "103_pre_T_AAACCTGTCCGTAGGC-1" [6] "103_pre_T_AAACGGGCAAAGAATC-1" head(colnames(nhoods_sce)) [1] "38641" "38123" "2886" "46657" "3694" "37096" mean(colnames(sce_milo) == rownames(nhoods_sce)) [1] 1 current.cells = rownames(nhoods_sce)[which(nhoods_sce[,hood_id] == 1)] head(current.cells) [1] "104_pre_T_ATCGAGTAGTGAACGC-1" "104_pre_T_CTCAGAAGTGGTAACG-1" "108_pre_T_AACTCAGTCAGGCGAA-1" "108_pre_T_ACCGTAAAGCCACCTG-1" "108_pre_T_AGAGCGACAACACGCC-1" [6] "108_pre_T_ATTCTACCAATAACGA-1" current.sce = sce_milo[,colnames(sce_milo) %in% current.cells] print(current.sce) class: Milo dim: 33545 527 metadata(0): assays(2): counts logcounts rownames(33545): MIR1302-2HG FAM138A ... E6 E7 rowData names(0): colnames(527): 104_pre_T_ATCGAGTAGTGAACGC-1 104_pre_T_CTCAGAAGTGGTAACG-1 ... 140_pre_T_TTCTACATCCTATGTT-1 140_pre_T_TTGCGTCCAGTACACT-1 colData names(32): pt_ID nUMI ... milo ident reducedDimNames(2): UMAP PCA mainExpName: RNA altExpNames(0): nhoods dimensions(2): 58582 321 nhoodCounts dimensions(2): 1 1 nhoodDistances dimension(1): 0 graph names(1): graph nhoodIndex names(1): 321 nhoodExpression dimension(2): 1 1 nhoodReducedDim names(0): nhoodGraph names(1): nhoodGraph nhoodAdjacency dimension(2): 321 321 table(current.sce$milo_sample_id , current.sce$milo_condition_id) < table of extent 0 x 0 >

`

amissarova commented 1 year ago

Im sorry, the last row was supposed to be:

table(current.sce$Patient , current.sce$milo)

Also, can you print out:

class(sce_milo$Patient)
class(sce_milo$milo)

aditisk commented 1 year ago

table(current.sce$Patient , current.sce$milo)

    Other_cells pMajor

P_104 0 2 P_108 0 13 P_109 21 0 P_111 41 0 P_112 12 0 P_116 1 0 P_117 71 0 P_119 2 0 P_124 1 0 P_129 0 186 P_130 1 0 P_131 0 95 P_133 2 0 P_136 30 0 P_137 4 0 P_138 2 0 P_139 0 1 P_140 42 0

class(sce_milo$Patient) [1] "character" class(sce_milo$milo) [1] "character"

amissarova commented 1 year ago

how about

dim(logcounts(current.sce))
sum(is.na(logcounts(current.sce)))

?

amissarova commented 1 year ago

I just realised that some of your features contain unorthodox names such as E6, E7 - can I ask what do they stand for? if they are not genes, can you pass only genes to the functions? there is a designated variable for this

aditisk commented 1 year ago

dim(logcounts(current.sce)) [1] 33545 527 sum(is.na(logcounts(current.sce))) [1] 0

Those genes are HPV genes that we are interested in investigating in our dataset. We have added the viral genome to our reference while mapping to get counts for these genes. However, if you think they might be causing a problem I can delete them from this analysis. Not sure if this matters, but I've used the same object with these genes and run Milo successfully.

amissarova commented 1 year ago

Milo is ~oblivious to genes since its estimating DA, not DE - so it is possible that it might be causing the problem. I know that sometimes Augur does not finish depending on which genes you enter.

So all I asked you to print out by now seems reasonable - so atm I would suggest testing this part of the algorithm (run original calc_AUC_per_neighbourhood) on different sets of genes.

Can you try the next three options and tell me if any of those are finishing successfully:

delete the HPV genes and submit all the remaining genes.
Select HVGs on all your entries. There are several ways of doing so, the one I usually go for is below:
```
dec.sce = scran::modelGeneVar(sce)
hvg.genes = scran::getTopHVGs(dec.sce, n = 5000)
```
delete the HPV genes and select HVGs from the remaining genes.

amissarova commented 1 year ago

Also, prior to all that, can you please ensure that all genes are unique (sometimes there are duplicates in the names - can you please manually convert them so they have the unqiue ID)? e.g. if you have 2 entries with the same gene name, add .1 and .2 at the end of the gene (although there is a check for this in the function, so I doubt it will be that but jic)

aditisk commented 1 year ago

Hi, thanks for all the suggestions. I double checked my gene names, there are no duplicates. I tried the original AUC function with 2 gene sets:

All genes except the HPV genes
5000 HVGs (these don't contain any HPV genes)

#Removed HPV genes
stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo,genes = genes_no_hpv,sample_id = "Patient" , condition_id = "milo", min_n_cells_per_sample = 10))
Error in `mutate()`:
! Problem while computing `metrics = pred %>% map(metric_fun)`.
Caused by error in `metric_set()`:
! Failed to compute `roc_auc()`.
Caused by error in `parse()`:
! <text>:1:7: unexpected symbol
1: Using an
          ^
Run `rlang::last_error()` to see where the error occurred.

##Selecting only HVGs
dec.sce = scran::modelGeneVar(baseline_sce)
hvg.genes = scran::getTopHVGs(dec.sce, n = 5000)

stat_auc = suppressWarnings(calc_AUC_per_neighbourhood(sce_milo,genes = hvg.genes,sample_id = "Patient" , condition_id = "milo",min_n_cells_per_sample = 10))
Error in `mutate()`:
! Problem while computing `metrics = pred %>% map(metric_fun)`.
Caused by error in `metric_set()`:
! Failed to compute `roc_auc()`.
Caused by error in `parse()`:
! <text>:1:7: unexpected symbol
1: Using an
          ^
Run `rlang::last_error()` to see where the error occurred.

amissarova commented 1 year ago

I cant really think on why this doesn't work wo further checking this myself (I tried on several ds now, it works for me). Can you maybe share some of the data (for one neighbourhood) so I can check myself (wo any proper metadata so nw about that). Feel free to rename the genes as well if that's a concern.

What I would need is:

hood_id = "38641"
current.cells = rownames(nhoods_sce)[which(nhoods_sce[,hood_id] == 1)]
current.sce = sce_milo[,colnames(sce_milo) %in% current.cells]
## send logcounts current.sce
logcounts(current.sce)
## send metadata containing only columns Patient_id and milo:
meta = as.data.frame(colData(current.sce))
meta = meta[, c("Patient" , "milo")]

Lmk if its ok with you to send this and I will try to run it myself

amissarova commented 1 year ago

and also, can you please send your session info:

sessionInfo()

aditisk commented 1 year ago

I'll have to check with my PI regarding data sharing. In the meantime, do you have an alternative suggestion for filtering neighborhoods for DE testing ? As you mentioned earlier, this is not a mandatory step and I can proceed with the DE testing by skipping this step as well.

Here is the sessionInfo()

R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS:   /usr/lib64/libblas.so.3.4.2
LAPACK: /usr/lib64/liblapack.so.3.4.2

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] EssReg_0.1.0                doParallel_1.0.17           iterators_1.0.14            SLIDE_0.1.0                 BiocParallel_1.30.3        
 [6] knitr_1.39                  foreach_1.5.2               plyr_1.8.7                  stringr_1.4.1               SeuratObject_4.1.3         
[11] Seurat_4.3.0                patchwork_1.1.2             dplyr_1.0.10                scater_1.24.0               ggplot2_3.3.6              
[16] scuttle_1.6.0               SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 Biobase_2.56.0              GenomicRanges_1.48.0       
[21] GenomeInfoDb_1.32.4         IRanges_2.30.1              S4Vectors_0.34.0            BiocGenerics_0.42.0         MatrixGenerics_1.8.0       
[26] matrixStats_0.63.0          miloR_1.4.0                 edgeR_3.38.0                limma_3.52.4                miloDE_0.0.0.9000          

loaded via a namespace (and not attached):
  [1] utf8_1.2.2                spatstat.explore_3.0-5    reticulate_1.26           tidyselect_1.2.0          htmlwidgets_1.5.4         grid_4.2.0               
  [7] Rtsne_0.16                munsell_0.5.0             ScaledMatrix_1.4.0        codetools_0.2-18          ica_1.0-3                 statmod_1.4.36           
 [13] scran_1.24.0              future_1.29.0             miniUI_0.1.1.1            withr_2.5.0               tester_0.1.7              spatstat.random_3.0-1    
 [19] colorspace_2.0-3          progressr_0.11.0          highr_0.9                 rstudioapi_0.13           ROCR_1.0-11               tensor_1.5               
 [25] ggsignif_0.6.3            pbmcapply_1.5.1           listenv_0.8.0             labeling_0.4.2            GenomeInfoDbData_1.2.8    polyclip_1.10-4          
 [31] farver_2.1.1              parallelly_1.32.1         vctrs_0.4.1               generics_0.1.3            xfun_0.31                 ipred_0.9-12             
 [37] randomForest_4.7-1        R6_2.5.1                  ggbeeswarm_0.6.0          graphlayouts_0.8.3        rsvd_1.0.5                locfit_1.5-9.5           
 [43] pals_1.7                  bitops_1.0-7              spatstat.utils_3.0-1      DelayedArray_0.22.0       assertthat_0.2.1          promises_1.2.0.1         
 [49] scales_1.2.1              ggraph_2.1.0              nnet_7.3-17               beeswarm_0.4.0            gtable_0.3.1              beachmat_2.12.0          
 [55] globals_0.16.2            goftest_1.2-3             tidygraph_1.2.2           timeDate_3043.102         rlang_1.0.6               splines_4.2.0            
 [61] rstatix_0.7.0             lazyeval_0.2.2            yardstick_1.1.0           dichromat_2.0-0.1         spatstat.geom_3.0-3       broom_1.0.0              
 [67] reshape2_1.4.4            abind_1.4-5               backports_1.4.1           httpuv_1.6.6              tools_4.2.0               lava_1.6.10              
 [73] Augur_1.0.3               ellipsis_0.3.2            RColorBrewer_1.1-3        ggridges_0.5.4            Rcpp_1.0.9                parsnip_1.0.4            
 [79] sparseMatrixStats_1.8.0   zlibbioc_1.42.0           purrr_0.3.5               RCurl_1.98-1.8            deldir_1.0-6              ggpubr_0.4.0             
 [85] rpart_4.1.16              pbapply_1.6-0             viridis_0.6.2             cowplot_1.1.1             zoo_1.8-11                ggrepel_0.9.2            
 [91] cluster_2.1.3             furrr_0.2.3               magrittr_2.0.3            data.table_1.14.6         scattermore_0.8           lmtest_0.9-40            
 [97] RANN_2.6.1                fitdistrplus_1.1-8        mime_0.12                 xtable_1.8-4              gridExtra_2.3             compiler_4.2.0           
[103] tibble_3.1.8              maps_3.4.0                crayon_1.5.2              KernSmooth_2.23-20        htmltools_0.5.3           later_1.3.0              
[109] tidyr_1.2.1               lubridate_1.8.0           DBI_1.1.3                 tweenr_2.0.2              MASS_7.3-57               Matrix_1.5-1             
[115] car_3.1-0                 cli_3.4.1                 metapod_1.4.0             gower_1.0.0               igraph_1.3.5              pkgconfig_2.0.3          
[121] RcppGreedySetCover_0.1.0  sp_1.5-1                  plotly_4.10.1             spatstat.sparse_3.0-0     recipes_0.2.0             vipor_0.4.5              
[127] hardhat_1.2.0             dqrng_0.3.0               XVector_0.36.0            prodlim_2019.11.13        digest_0.6.30             sctransform_0.3.5        
[133] RcppAnnoy_0.0.20          spatstat.data_3.0-0       leiden_0.4.3              uwot_0.1.14               DelayedMatrixStats_1.18.0 shiny_1.7.3              
[139] gtools_3.9.4              nlme_3.1-157              lifecycle_1.0.1           jsonlite_1.8.3            carData_3.0-5             BiocNeighbors_1.14.0     
[145] mapproj_1.2.8             viridisLite_0.4.1         fansi_1.0.3               pillar_1.8.1              lattice_0.20-45           fastmap_1.1.0            
[151] httr_1.4.4                survival_3.3-1            glue_1.6.2                png_0.1-7                 bluster_1.6.0             ggforce_0.4.1            
[157] class_7.3-20

amissarova commented 1 year ago

Im gonna check asap your data dependencies - but if that is really the case that they re somehow broken, can you check on your end if you can run this part of the package on a different dataset, for which I know it should work. So here is a snippet from the vignette, can you run this?

library(miloDE)

# library containing toy data
suppressMessages(library(MouseGastrulationData))

# analysis libraries
library(scuttle)
suppressMessages(library(miloR))
suppressMessages(library(uwot))
library(scran)
suppressMessages(library(dplyr))
library(reshape2)

# plotting libraries
library(ggplot2)
library(viridis)
#> Loading required package: viridisLite
library(ggpubr)

# load chimera Tal1
sce = suppressMessages(MouseGastrulationData::Tal1ChimeraData())

# downsample to few selected cell types
cts = c("Spinal cord" , "Haematoendothelial progenitors", "Endothelium" , "Blood progenitors 1" , "Blood progenitors 2")
sce = sce[, sce$celltype.mapped %in% cts]
# let's rename Haematoendothelial progenitors
sce$celltype.mapped[sce$celltype.mapped == "Haematoendothelial progenitors"] = "Haem. prog-s."

# delete row for tomato
sce = sce[!rownames(sce) == "tomato-td" , ]

# add logcounts
sce = logNormCounts(sce)

# update tomato field to be more interpretable 
sce$tomato = sapply(sce$tomato , function(x) ifelse(isTRUE(x) , "Tal1_KO" , "WT"))

# for this exercise, we focus on 3000 highly variable genes (for computational efficiency)
dec.sce = modelGeneVar(sce)
hvg.genes = getTopHVGs(dec.sce, n = 3000)
sce = sce[hvg.genes , ]
# change rownames to symbol names
rowdata = as.data.frame(rowData(sce))
rownames(sce) = rowdata$SYMBOL

# add UMAPs
set.seed(32)
umaps = as.data.frame(uwot::umap(reducedDim(sce , "pca.corrected")))
# let's store UMAPs - we will use them for visualisation
reducedDim(sce , "UMAP") = umaps

## assign neighbourhoods
set.seed(32)
sce_milo = assign_neighbourhoods(sce , k = 20 , order = 2, 
                                 filtering = TRUE , reducedDim_name = "pca.corrected" , verbose = F)

## run AUC
stat_auc = calc_AUC_per_neighbourhood(sce_milo , sample_id = "sample" , condition_id = "tomato", min_n_cells_per_sample = 1)

feel free to change variable names ofc

amissarova commented 1 year ago

Re data sharing - yeah, ofc. You could stress that this is only for one neighbourhood (small subset of the data). Re selection of neighbourhoods - couple of comments: 1) you can comfortably proceed without any selection to begin with - the expectation, that multiple testing correction to a big degree will eliminate FPs. Ofc, selection helps and increases power, but that's optional. 340 neighbourhoods is not a lot, so that should be fine (although bare in mind that it will run for a while). 2) I personally find miloDE useful the most when it is performed at the more downstream part of the analysis, once you already established which CTs are of particular interest. To exemplify - say you did some main analysis, and found that you cell type CT1 is of particular interest. In addition, you have reasons to believe that there is a certain intra-cell type heterogeneity in CT1. Then you can subset your cells only to this cell type CT1 and perform miloDE on it (will be much less neighbourhoods and much faster). 3) You can also increase k slightly to decrease the number of neighbourhoods. 4) you can also focus on neighbourhoods that are not extremely DA. You can select certain threshold (say 10?), and select neighborhoods in which both conditions contain at least a preselected number of cells.

Hope that helps

amissarova commented 1 year ago

Can you please install these 2 packages and check if it works: rsample (>= 0.0.4) glmnet (>= 2.0) ?

aditisk commented 1 year ago

Hi, thanks for suggestions regarding the use of miloDE, I'll think about it and see what fits our dataset and use case the best.

I tried running the snippet you shared earlier (after installing rsample and glmnet) and I am still getting the same error with the test dataset:

> sce_milo = assign_neighbourhoods(sce , k = 20 , order = 2, 
+                                  filtering = TRUE , reducedDim_name = "pca.corrected" , verbose = F)
100% covered by 42 sets.
> ## run AUC
> stat_auc = calc_AUC_per_neighbourhood(sce_milo , sample_id = "sample" , condition_id = "tomato", min_n_cells_per_sample = 1)
Error in `mutate()`:
! Problem while computing `metrics = pred %>% map(metric_fun)`.
Caused by error in `metric_set()`:
! Failed to compute `roc_auc()`.
Caused by error in `parse()`:
! <text>:1:7: unexpected symbol
1: Using an
          ^
Run `rlang::last_error()` to see where the error occurred.

amissarova commented 1 year ago

Since you can not run this snippet, Im pretty confident that the issue is with the dependencies.

I see that you installed rsample and glmnet so it must be something else.
I cant check rn your sessionInfo in greater details right now, but I will take a closer look tonight or tomorrow.

aditisk commented 1 year ago

Sounds good, thank you so much for all your help and patience with debugging this.

amissarova commented 1 year ago

I have a feeling its something to do with potential incompatibility of AUC calculation with the newest yardstick. Can you perhaps downgrade it:

install.packages("yardstick", version = "1.0.0")

And then reinstall miloDE. Let me know if it solves the problem?

aditisk commented 1 year ago

The versions on R installed on our computing cluster are not compatible with this older version of yardstick. I'll try this on my personal computer next week and get back to you.

amissarova commented 1 year ago

Thanks - i know its not ideal, sorry for that Meanwhile I submitted the issue w Augur developers, I'll let you know what they say (possibly it will be fixed on their end if that's really what was causing the problem, and then miloDE will be compatible with new version of yeardstick). Meanwhile, for you I would also suggest to simply proceed wo this part of the algorithm

RussBainer commented 1 year ago

Hi, not sure if this is related or not, but I am also hitting an error when I get to the AUC calculation step:

> milo
class: Milo 
dim: 24018 128301 
metadata(0):
assays(3): counts logcounts logcounts.raw
rownames(24018): AL627309.1 LINC01128 ... AC007325.4 AC007325.2
rowData names(0):
colnames(128301): 68011_AAACCCAGTAATACCC-1 68011_AAACCCAGTACTCGAT-1 ... 69361_TTTGTTGGTCAAGTTC-1
  69361_TTTGTTGTCTCTTGCG-1
colData names(25): orig.ident nCount_RNA ... eGFR_class Risk_Geno
reducedDimNames(1): PCA.integrated
mainExpName: RNA
altExpNames(0):
nhoods dimensions(2): 128301 434
nhoodCounts dimensions(2): 1 1
nhoodDistances dimension(1): 0
graph names(1): graph
nhoodIndex names(1): 434
nhoodExpression dimension(2): 1 1
nhoodReducedDim names(0):
nhoodGraph names(1): nhoodGraph
nhoodAdjacency dimension(2): 434 434
>   stat_auc <- suppressWarnings(calc_AUC_per_neighbourhood(milo, n_threads = 2,
+                                                           sample_id = 'ID',
+                                                           condition_id = "Risk_Geno", 
+                                                           min_n_cells_per_sample = 1, 
+                                                           BPPARAM = SerialParam()))
Error: BiocParallel errors
  1 remote errors, element index: 1
  433 unevaluated and other errors
  first remote error:
Error in calculate_auc(logcounts(current.sce), meta, cell_type_col = "celltype.dummy", : input must be Seurat, monocle, sparse matrix, numeric matrix, or numeric data frame

Any ideas? Please let me know if I should migrate this to another issue.

RussBainer commented 1 year ago

Following up above- this appears to be an error on my end, please disregard.

The issue was that a preprocessing step had converted the logcounts slot of milo to a ResidualMatrix, which does not appear to be supported by the AUC architecture. Leaving this here in case others encounter a similar error, otherwise please consider this resolved.

Thanks so much for your work on developing this great package!

MarioniLab / miloDE

Error while running the calc_AUC_per_neighbourhood() function #26

rlang::last_error() <error/dplyr:::mutate_error> Error in `mutate()`: ! Problem while computing `metrics = pred %>% map(metric_fun)`. Caused by error in `metric_set()`: ! Failed to compute `roc_auc()`. Caused by error in `parse()`: ! :1:7: unexpected symbol 1: Using an ^

With parallelization

Without parallelization

assign condition and sample ids

filter out for relevant cells

delete neighbourhoods that contain 0 cells (possible due to the upstream filtering)

paste here your neighbourhood id

MarioniLab / miloDE

Error while running the calc_AUC_per_neighbourhood() function #26

rlang::last_error() <error/dplyr:::mutate_error> Error in mutate(): ! Problem while computing metrics = pred %>% map(metric_fun). Caused by error in metric_set(): ! Failed to compute roc_auc(). Caused by error in parse(): ! :1:7: unexpected symbol 1: Using an ^

With parallelization

Without parallelization

assign condition and sample ids

filter out for relevant cells

delete neighbourhoods that contain 0 cells (possible due to the upstream filtering)

paste here your neighbourhood id

rlang::last_error() <error/dplyr:::mutate_error> Error in `mutate()`: ! Problem while computing `metrics = pred %>% map(metric_fun)`. Caused by error in `metric_set()`: ! Failed to compute `roc_auc()`. Caused by error in `parse()`: ! :1:7: unexpected symbol 1: Using an ^