NaN in coefficient when using model with several covariates (but FDR < 1)

fbrundu commented 4 years ago

Hi! Thanks for your work on this tool. I have an issue computing differential gene expression in a model with several covariates. The model is the following:

~ group + n_genes + pair + percent_mito + percent_ribo_p

where group is the condition that I want to test (i.e. Case or Control). n_genes is the number of detected genes, pair is a categorical label that indicates the batch, percent_mito and percent_ribo_p are respectively the percentages of mitochondrial RNA and of the ribosomal proteins RNA in each cell.

When I analyze the DEG results, I notice that some genes have NAs in place of coefficient, but with FDR < 1. For example:

primerid	component	contrast	Pr(>Chisq)	ci.hi	ci.lo	coef	z
CHCHD2	C	groupControl	1	NA	NA	NA	NA
CHCHD2	C	(Intercept)	NA	13.2139294612287	7.27487862856727	10.244404044898	6.76157303138578
CHCHD2	C	n_genes	NA	0.00026628978864847	-0.000415653246692049	-7.46817290217894E-05	-0.42928365449996
CHCHD2	C	pair	NA	0.603598745441239	-1.44191779536421	-0.419159524961484	-0.803256836415448
CHCHD2	C	percent_mito	NA	26.8917686666073	-24.2948918010666	1.29843843277039	0.0994357725673384
CHCHD2	C	percent_ribo_p	NA	7.78446426292038	-2.99466694859656	2.39489865716191	0.870926426731957
CHCHD2	D	groupControl	2.187508828583E-11	6.92093514687845	2.50218975637947	4.71156245162896	4.17968988933359
CHCHD2	D	(Intercept)	NA	-5.45262320626328	-12.9538395476205	-9.20323137694191	-4.80935390191168
CHCHD2	D	n_genes	NA	0.000605309648438176	-0.000322803454311596	0.00014125309706329	0.596588890143987
CHCHD2	D	pair	NA	2.63157523052559	0.813623683086202	1.72259945680589	3.7143266000497
CHCHD2	D	percent_mito	NA	6.69954331431088	-48.5358111642693	-20.9181339249792	-1.4845125736504
CHCHD2	D	percent_ribo_p	NA	11.8916501312325	-5.06608244072294	3.41278384525477	0.788894788302249
CHCHD2	H	groupControl	2.187508828583E-11	NA	NA	NA	NA
CHCHD2	S	groupControl	NA	NA	NA	NA	NA
CHCHD2	S	(Intercept)	NA	NA	NA	NA	1.38042738481323
CHCHD2	S	n_genes	NA	NA	NA	NA	0.118302666651904
CHCHD2	S	pair	NA	NA	NA	NA	2.0584371703729
CHCHD2	S	percent_mito	NA	NA	NA	NA	-0.979397198510006
CHCHD2	S	percent_ribo_p	NA	NA	NA	NA	1.17367083670798
CHCHD2	logFC	groupControl	NA	NA	NA	NA	NA
CHCHD2	logFC	n_genes	NA	7.28238715372072E-07	-4.51844845269732E-07	1.3819693505117E-07	0.459053959410812
CHCHD2	logFC	pair	NA	0.0192958459598146	-0.0102844516875691	0.00450569713612275	0.597086900024951
CHCHD2	logFC	percent_mito	NA	0.00289076102603723	-0.00495405141142126	-0.00103164519269202	-0.515496689977027
CHCHD2	logFC	percent_ribo_p	NA	0.323601607126519	-0.248633415386968	0.0374840958697755	0.256773790513932

On the other hand, one gene with FDR < 1 but coefficient different from NA is the following: primerid	component	contrast	Pr(>Chisq)	ci.hi	ci.lo	coef	z
COMT	C	groupControl	0.85483797977702	2.91033719012339	-1.98702041381175	0.461658388155819	0.369519192643462
COMT	C	(Intercept)	NA	10.7859291790623	3.55357927507236	7.16975422706733	3.88600116133854
COMT	C	n_genes	NA	0.000338013386848622	-0.000638047565651715	-0.000150017089401546	-0.602478956953161
COMT	C	pair	NA	0.763112385689135	-0.618137024899985	0.0724876803945752	0.20571700057504
COMT	C	percent_mito	NA	39.1117813733424	-8.40630622779144	15.3527375727755	1.26649931534773
COMT	C	percent_ribo_p	NA	6.32602519941846	-11.8236927561382	-2.74883377835986	-0.593685832282923
COMT	D	groupControl	8.72957003792791E-05	4.89103889665209	1.16019280877743	3.02561585271476	3.17895617385414
COMT	D	(Intercept)	NA	-4.01790206062373	-10.9537976822741	-7.48584987144891	-4.23074306248644
COMT	D	n_genes	NA	0.00127192998088608	0.000307072973360384	0.000789501477123231	3.20751043695249
COMT	D	pair	NA	0.816996259468842	-0.831152113252747	-0.00707792689195255	-0.0168340205566892
COMT	D	percent_mito	NA	56.7523833573714	4.82638755934986	30.7893854583606	2.32431119238381
COMT	D	percent_ribo_p	NA	10.1403022854827	-8.9881881848373	0.576057050322684	0.118049155360658
COMT	H	groupControl	0.000446798779218395	NA	NA	NA	NA
COMT	S	groupControl	NA	NA	NA	NA	2.50915099452388
COMT	S	(Intercept)	NA	NA	NA	NA	-0.243769336060817
COMT	S	n_genes	NA	NA	NA	NA	1.84203542471195
COMT	S	pair	NA	NA	NA	NA	0.133560436021699
COMT	S	percent_mito	NA	NA	NA	NA	2.53908645997288
COMT	S	percent_ribo_p	NA	NA	NA	NA	-0.336325919632768
COMT	logFC	groupControl	NA	0.358433760158453	-0.19205639243217	0.0831886838631414	0.592369631048775
COMT	logFC	n_genes	NA	1.30767850726043E-05	-6.89901568399462E-06	3.08888469430484E-06	0.606143686253434
COMT	logFC	pair	NA	0.00351857412776664	-0.00349453670414979	1.2018711808429E-05	0.00671777271161418
COMT	logFC	percent_mito	NA	45.7679197278209	-0.730975601504273	22.5184720631583	1.89834162373432
COMT	logFC	percent_ribo_p	NA	0.0549897837974593	-0.0542142072425823	0.000387788277438516	0.0139198402946213

I am not sure how to interpret genes with FDR < 0.01 (for example) but no coefficient. Is this an issue, or how can it be interpreted? I was also reading https://github.com/RGLab/MAST/issues/98 but I'm not sure how to adapt the reply to that issue to my data.

I created a small dataset (n=132) of cells in which this behavior appears, that I can share privately if necessary. Please let me know if you need other information.

Thanks

lijiaxiaoxiong commented 4 years ago

I also encounter the same problem. I found many genes with NA logFC are also with small FDR, and they could be interesting study objects. Any help will be appreciated!

gfinak commented 4 years ago

NA is missing, NaN is "Not a Number". They are not the same. You have inestimable coefficients for some genes because there's not enough data to fit the proposed model. Do you filter genes that have low expression frequency across cells? Have you plotted your data vs the predictors for one of these genes and compared it to a gene that has all coefficients properly estimated? Maybe post your test data set, include genes that do and don't show this problem.

fbrundu commented 4 years ago

Hi @gfinak, thanks for your reply. The NaN for the coefficient (i.e. coef) gets translated to NA when saving the results to file, that's why it appears as NA here. But in the original results is NaN, hence the title. Regarding the low expression, how do you define the threshold for expression frequency across cells? I.e. which threshold is necessary for MAST to correctly estimate DE? In the previous example, COMT - correctly estimated - is expressed in 12 cells, CHCHD2 (NaN coef) is instead detected in 22 cells. I am not sure how to plot the data against the predictors, is it included in the tutorial? I can send you a minimal example dataset privately, is there an email address I can use to send the link to the rds file?

Thanks

fbrundu commented 4 years ago

I possibly spotted the issue. Assuming testing by condition, I computed the number of counts for each condition, for each gene (correctly estimated and not). The genes not estimated have one condition with zero counts, while the others have positive counts on both conditions. That's possibly why the coefficient cannot be estimated.

gfinak commented 4 years ago

Yes, that's precisely why. It is recommended to do some pre-filtering by removing genes that are expressed in fewer than 10% of cells i.e. if M is your matrix of counts and rows are genes and columns are cells, keeps genes where rowMeans(M>0) > 0.1 How many total cells are in your experiment?

fbrundu commented 4 years ago

Ok, thanks! Just asking: is there a way that you know to define the threshold more accurately? I imagine that if one population contributes to less than 10% of total number of cells and the markers are highly specific, we might lose those markers, even if we have enough cells to compute DE. The experiment has a total of 30k cells, however, I evaluate the condition in each cell subtype (in this case n=132), usually ranging from 100 to 5k cells. I use a very loose threshold on the genes I test (minimum of n=3 cells) because I would like not to filter too much beforehand without a clear rationale.

gfinak commented 4 years ago

Think about statistical power. How much power do you have to detect a difference with a sample size of 3? Especially after you adjust for multiple testing. We chose 10% because it's an empirical lower limit for the discrete part of the test.

fbrundu commented 4 years ago

Ok thanks, it is clear.

fbrundu commented 4 years ago

For future reference:

If I read it correctly, such genes where the continuous component cannot be estimated should be dropped.

RGLab / MAST

NaN in coefficient when using model with several covariates (but FDR < 1) #135