fzhaouf / scaDA

differential accessibility using scARA-seq
Other
1 stars 0 forks source link

0 (non-NA) cases #1

Closed Puriney closed 7 months ago

Puriney commented 7 months ago

Hello,

I find this error in my data:

start initial parameter estiamte
Error in lm.wfit(X[Y1, , drop = FALSE], log(Y[Y1]) - offsetx[Y1], weights[Y1]) : 
  0 (non-NA) cases

I find the reason is that if a group of cells have all zero counts in a peak, the test will fail at this particular peak. My question to your team is that how to statistically accommodate this case? My concern is that a peak having all zeros in cell group A but not in group B may biologically mean this peak is closed in group A and accessible in group B. A test should not fail at this peak.

This is the minimal reproducible example to repeat this issue I reported. What I did was to replace all fragment counts of the first peak in all the microglia cells by zero. The rest remains the same as the tutorial. This snippet fails immediately.

library(scaDA)
data("HumanBrain", package = "scaDA") # load human brain dataset
counts = HumanBrain@assays$ATAC@counts
coldata = HumanBrain@meta.data$celltype
idx_is_microglia <- coldata == 'microglia' #!!!
counts[1, idx_is_microglia] <- 0 # !!!
scaDA.obj <- scaDAdatasetFromMatrix(count = as.matrix(counts), colData = data.frame(coldata))
scaDA.obj <- estParams(scaDA.obj, group.1 = "microglia", group.2 = "astrocytes")

Another comment is that: If the test at a peak fails, the for loop inside estParams immediately fails and quits without caching the test results successfully done in other peaks. I was wondering if it is statistically reasonable to still use these intermediate results. If so, I would suggest redesigning the function using something like try.

fzhaouf commented 7 months ago

Hi Puriney thank you for your insightful questions. zinb model does not work well when the data is filled with all zeros or with all non-zeros. we are currently working on applying some changes to the code and implementing more validity checks of the scaDAobj to address these issues. Our analysis of numerous real datasets has shown that the number of peaks falling into these extreme categories is extremely small compared to the overall number of peaks. We are planning to implement a Wilcoxon test/nonparametric method for handling these outliers. Regarding your second comment, in the current version, if the pipeline encounters an error, the results are not saved in the obj, and therefore the final output does not accurately reflect our method. This is definitely an area for improvement. will look into your suggestion regarding try.

Thank you! Fengdi

Puriney commented 7 months ago

zinb model does not work well when the data is filled with all zeros or with all non-zeros

I agree with you. Zinb is a zero-inflated therefore it probably works very well with all non-zeros. Peaks with all zeros will be removed during the preprocessing step so it won't bother. However, to clarify all (non-)zeros, what I mean is that a peak has all zeros in one cell group only. For example, an epithelial population has 500 cells and a T cell population has 500 cells, a peak around RUNX3 (a T cell lineage transcription factor) has all zeros in epithelial and many non-zeros in T (as expected). In this example, the test should call this peak out.

fzhaouf commented 7 months ago

the package is based on a two-group design so far, cannot handle all zeros/ all non-zeros in either condition. For the mentioned case, using wilcoxon test to call peaks would be most appropriate, and wilcoxon is the default method in many DA pipelines. we can make changes fairly easily to handle such cases in the current version. later, if we expand the method to a glm framework, it could handle your mentioned case, in glm we can directly make inference based on the pool data across different conditions together. hope this helps

Puriney commented 7 months ago

Great. Thanks a lot. Looking forward to updates. Yes, there are a bunch of DA methods out there. Some worked and some did not work well. As your method addresses some specific statistical properties of scATAC and is fairly new, I am giving a try.