Closed fbrundu closed 4 years ago
I also encounter the same problem. I found many genes with NA logFC are also with small FDR, and they could be interesting study objects. Any help will be appreciated!
NA is missing, NaN is "Not a Number". They are not the same. You have inestimable coefficients for some genes because there's not enough data to fit the proposed model. Do you filter genes that have low expression frequency across cells? Have you plotted your data vs the predictors for one of these genes and compared it to a gene that has all coefficients properly estimated? Maybe post your test data set, include genes that do and don't show this problem.
Hi @gfinak, thanks for your reply.
The NaN for the coefficient (i.e. coef
) gets translated to NA when saving the results to file, that's why it appears as NA here. But in the original results is NaN, hence the title.
Regarding the low expression, how do you define the threshold for expression frequency across cells? I.e. which threshold is necessary for MAST to correctly estimate DE?
In the previous example, COMT - correctly estimated - is expressed in 12 cells, CHCHD2 (NaN coef) is instead detected in 22 cells.
I am not sure how to plot the data against the predictors, is it included in the tutorial?
I can send you a minimal example dataset privately, is there an email address I can use to send the link to the rds file?
Thanks
I possibly spotted the issue. Assuming testing by condition
, I computed the number of counts for each condition, for each gene (correctly estimated and not).
The genes not estimated have one condition with zero counts, while the others have positive counts on both conditions. That's possibly why the coefficient cannot be estimated.
Yes, that's precisely why.
It is recommended to do some pre-filtering by removing genes that are expressed in fewer than 10% of cells i.e. if M is your matrix of counts and rows are genes and columns are cells, keeps genes where rowMeans(M>0) > 0.1
How many total cells are in your experiment?
Ok, thanks! Just asking: is there a way that you know to define the threshold more accurately? I imagine that if one population contributes to less than 10% of total number of cells and the markers are highly specific, we might lose those markers, even if we have enough cells to compute DE. The experiment has a total of 30k cells, however, I evaluate the condition in each cell subtype (in this case n=132), usually ranging from 100 to 5k cells. I use a very loose threshold on the genes I test (minimum of n=3 cells) because I would like not to filter too much beforehand without a clear rationale.
Think about statistical power. How much power do you have to detect a difference with a sample size of 3? Especially after you adjust for multiple testing. We chose 10% because it's an empirical lower limit for the discrete part of the test.
Ok thanks, it is clear.
For future reference:
If I read it correctly, such genes where the continuous component cannot be estimated should be dropped.
Hi! Thanks for your work on this tool. I have an issue computing differential gene expression in a model with several covariates. The model is the following:
~ group + n_genes + pair + percent_mito + percent_ribo_p
where group is the condition that I want to test (i.e. Case or Control). n_genes is the number of detected genes, pair is a categorical label that indicates the batch, percent_mito and percent_ribo_p are respectively the percentages of mitochondrial RNA and of the ribosomal proteins RNA in each cell.
When I analyze the DEG results, I notice that some genes have NAs in place of coefficient, but with FDR < 1. For example:
I am not sure how to interpret genes with FDR < 0.01 (for example) but no coefficient. Is this an issue, or how can it be interpreted? I was also reading https://github.com/RGLab/MAST/issues/98 but I'm not sure how to adapt the reply to that issue to my data.
I created a small dataset (n=132) of cells in which this behavior appears, that I can share privately if necessary. Please let me know if you need other information.
Thanks