RGLab / MAST

Tools and methods for analysis of single cell assay data in R
224 stars 57 forks source link

NaN in coefficient when using model with several covariates (but FDR < 1) #135

Closed fbrundu closed 4 years ago

fbrundu commented 4 years ago

Hi! Thanks for your work on this tool. I have an issue computing differential gene expression in a model with several covariates. The model is the following:

~ group + n_genes + pair + percent_mito + percent_ribo_p

where group is the condition that I want to test (i.e. Case or Control). n_genes is the number of detected genes, pair is a categorical label that indicates the batch, percent_mito and percent_ribo_p are respectively the percentages of mitochondrial RNA and of the ribosomal proteins RNA in each cell.

When I analyze the DEG results, I notice that some genes have NAs in place of coefficient, but with FDR < 1. For example:

primerid component contrast Pr(>Chisq) ci.hi ci.lo coef z
CHCHD2 C groupControl 1 NA NA NA NA
CHCHD2 C (Intercept) NA 13.2139294612287 7.27487862856727 10.244404044898 6.76157303138578
CHCHD2 C n_genes NA 0.00026628978864847 -0.000415653246692049 -7.46817290217894E-05 -0.42928365449996
CHCHD2 C pair NA 0.603598745441239 -1.44191779536421 -0.419159524961484 -0.803256836415448
CHCHD2 C percent_mito NA 26.8917686666073 -24.2948918010666 1.29843843277039 0.0994357725673384
CHCHD2 C percent_ribo_p NA 7.78446426292038 -2.99466694859656 2.39489865716191 0.870926426731957
CHCHD2 D groupControl 2.187508828583E-11 6.92093514687845 2.50218975637947 4.71156245162896 4.17968988933359
CHCHD2 D (Intercept) NA -5.45262320626328 -12.9538395476205 -9.20323137694191 -4.80935390191168
CHCHD2 D n_genes NA 0.000605309648438176 -0.000322803454311596 0.00014125309706329 0.596588890143987
CHCHD2 D pair NA 2.63157523052559 0.813623683086202 1.72259945680589 3.7143266000497
CHCHD2 D percent_mito NA 6.69954331431088 -48.5358111642693 -20.9181339249792 -1.4845125736504
CHCHD2 D percent_ribo_p NA 11.8916501312325 -5.06608244072294 3.41278384525477 0.788894788302249
CHCHD2 H groupControl 2.187508828583E-11 NA NA NA NA
CHCHD2 S groupControl NA NA NA NA NA
CHCHD2 S (Intercept) NA NA NA NA 1.38042738481323
CHCHD2 S n_genes NA NA NA NA 0.118302666651904
CHCHD2 S pair NA NA NA NA 2.0584371703729
CHCHD2 S percent_mito NA NA NA NA -0.979397198510006
CHCHD2 S percent_ribo_p NA NA NA NA 1.17367083670798
CHCHD2 logFC groupControl NA NA NA NA NA
CHCHD2 logFC n_genes NA 7.28238715372072E-07 -4.51844845269732E-07 1.3819693505117E-07 0.459053959410812
CHCHD2 logFC pair NA 0.0192958459598146 -0.0102844516875691 0.00450569713612275 0.597086900024951
CHCHD2 logFC percent_mito NA 0.00289076102603723 -0.00495405141142126 -0.00103164519269202 -0.515496689977027
CHCHD2 logFC percent_ribo_p NA 0.323601607126519 -0.248633415386968 0.0374840958697755 0.256773790513932
On the other hand, one gene with FDR < 1 but coefficient different from NA is the following: primerid component contrast Pr(>Chisq) ci.hi ci.lo coef z
COMT C groupControl 0.85483797977702 2.91033719012339 -1.98702041381175 0.461658388155819 0.369519192643462
COMT C (Intercept) NA 10.7859291790623 3.55357927507236 7.16975422706733 3.88600116133854
COMT C n_genes NA 0.000338013386848622 -0.000638047565651715 -0.000150017089401546 -0.602478956953161
COMT C pair NA 0.763112385689135 -0.618137024899985 0.0724876803945752 0.20571700057504
COMT C percent_mito NA 39.1117813733424 -8.40630622779144 15.3527375727755 1.26649931534773
COMT C percent_ribo_p NA 6.32602519941846 -11.8236927561382 -2.74883377835986 -0.593685832282923
COMT D groupControl 8.72957003792791E-05 4.89103889665209 1.16019280877743 3.02561585271476 3.17895617385414
COMT D (Intercept) NA -4.01790206062373 -10.9537976822741 -7.48584987144891 -4.23074306248644
COMT D n_genes NA 0.00127192998088608 0.000307072973360384 0.000789501477123231 3.20751043695249
COMT D pair NA 0.816996259468842 -0.831152113252747 -0.00707792689195255 -0.0168340205566892
COMT D percent_mito NA 56.7523833573714 4.82638755934986 30.7893854583606 2.32431119238381
COMT D percent_ribo_p NA 10.1403022854827 -8.9881881848373 0.576057050322684 0.118049155360658
COMT H groupControl 0.000446798779218395 NA NA NA NA
COMT S groupControl NA NA NA NA 2.50915099452388
COMT S (Intercept) NA NA NA NA -0.243769336060817
COMT S n_genes NA NA NA NA 1.84203542471195
COMT S pair NA NA NA NA 0.133560436021699
COMT S percent_mito NA NA NA NA 2.53908645997288
COMT S percent_ribo_p NA NA NA NA -0.336325919632768
COMT logFC groupControl NA 0.358433760158453 -0.19205639243217 0.0831886838631414 0.592369631048775
COMT logFC n_genes NA 1.30767850726043E-05 -6.89901568399462E-06 3.08888469430484E-06 0.606143686253434
COMT logFC pair NA 0.00351857412776664 -0.00349453670414979 1.2018711808429E-05 0.00671777271161418
COMT logFC percent_mito NA 45.7679197278209 -0.730975601504273 22.5184720631583 1.89834162373432
COMT logFC percent_ribo_p NA 0.0549897837974593 -0.0542142072425823 0.000387788277438516 0.0139198402946213

I am not sure how to interpret genes with FDR < 0.01 (for example) but no coefficient. Is this an issue, or how can it be interpreted? I was also reading https://github.com/RGLab/MAST/issues/98 but I'm not sure how to adapt the reply to that issue to my data.

I created a small dataset (n=132) of cells in which this behavior appears, that I can share privately if necessary. Please let me know if you need other information.

Thanks

lijiaxiaoxiong commented 4 years ago

I also encounter the same problem. I found many genes with NA logFC are also with small FDR, and they could be interesting study objects. Any help will be appreciated!

gfinak commented 4 years ago

NA is missing, NaN is "Not a Number". They are not the same. You have inestimable coefficients for some genes because there's not enough data to fit the proposed model. Do you filter genes that have low expression frequency across cells? Have you plotted your data vs the predictors for one of these genes and compared it to a gene that has all coefficients properly estimated? Maybe post your test data set, include genes that do and don't show this problem.

fbrundu commented 4 years ago

Hi @gfinak, thanks for your reply. The NaN for the coefficient (i.e. coef) gets translated to NA when saving the results to file, that's why it appears as NA here. But in the original results is NaN, hence the title. Regarding the low expression, how do you define the threshold for expression frequency across cells? I.e. which threshold is necessary for MAST to correctly estimate DE? In the previous example, COMT - correctly estimated - is expressed in 12 cells, CHCHD2 (NaN coef) is instead detected in 22 cells. I am not sure how to plot the data against the predictors, is it included in the tutorial? I can send you a minimal example dataset privately, is there an email address I can use to send the link to the rds file?

Thanks

fbrundu commented 4 years ago

I possibly spotted the issue. Assuming testing by condition, I computed the number of counts for each condition, for each gene (correctly estimated and not). The genes not estimated have one condition with zero counts, while the others have positive counts on both conditions. That's possibly why the coefficient cannot be estimated.

gfinak commented 4 years ago

Yes, that's precisely why. It is recommended to do some pre-filtering by removing genes that are expressed in fewer than 10% of cells i.e. if M is your matrix of counts and rows are genes and columns are cells, keeps genes where rowMeans(M>0) > 0.1 How many total cells are in your experiment?

fbrundu commented 4 years ago

Ok, thanks! Just asking: is there a way that you know to define the threshold more accurately? I imagine that if one population contributes to less than 10% of total number of cells and the markers are highly specific, we might lose those markers, even if we have enough cells to compute DE. The experiment has a total of 30k cells, however, I evaluate the condition in each cell subtype (in this case n=132), usually ranging from 100 to 5k cells. I use a very loose threshold on the genes I test (minimum of n=3 cells) because I would like not to filter too much beforehand without a clear rationale.

gfinak commented 4 years ago

Think about statistical power. How much power do you have to detect a difference with a sample size of 3? Especially after you adjust for multiple testing. We chose 10% because it's an empirical lower limit for the discrete part of the test.

fbrundu commented 4 years ago

Ok thanks, it is clear.

fbrundu commented 4 years ago

For future reference:

If I read it correctly, such genes where the continuous component cannot be estimated should be dropped.