RfastOfficial / Rfast2

A collection of Rfast2 functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
38 stars 4 forks source link

Make multinom.reg accept a matrix with outcome counts & optionally also output variance-covariance matrix #19

Open tomwenseleers opened 1 year ago

tomwenseleers commented 1 year ago

I was wondering if it would be possible to change the multinom.reg function so that it would accept a matrix with outcome counts Y (with different outcome levels in different columns), similar to the input format expected by the MGLMregfunction in library(MGLM). Now it seems multinom.reg expects each individual count in long format, but this will of course become unfeasible & become very slow when one is dealing with large counts.

It could also be nice if there was an option to output the variance covariance matrix (input argument vcov=T or F). Here I posted a bit of Rcpp code to calculate that for a multinomial fit using Kronecker products: https://stackoverflow.com/questions/73811835/faster-way-to-calculate-the-hessian-fisher-information-matrix-of-a-nnetmulti

Very efficient multinomial regression algorithms with L2 norm (ridge) or L1 norm (LASSO) regularisation have also been described by https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1424458. I haven't yet seen those ported to R, though I saw a Matlab, C & C# implementation here : https://github.com/slavomir-sidor/princeton-mvpa-toolbox/blob/master/core/learn/smlr.m https://github.com/slavomir-sidor/princeton-mvpa-toolbox/blob/master/core/learn/smlr_mex.c https://github.com/inzxx/adastra/blob/master/src/Accord24/Accord.Statistics/Models/Regression/Nonlinear/Fitting/LowerBoundNewtonRaphson.cs

There would be a closed-form formula for the variance-covariance matrix only for the ones with a ridge penalty though I believe...

statlink commented 1 year ago

Hi Tom,

The function accepts a long format, but actually it accepts a vector. A categorical variable. We do not do what you are suggesting. I understand your suggestion, but that would require to do the mathematics again and change the codes, etc.

About the covariance matrix of the coefficients I did not want to show it because I think it is not consistent and if one wants to perform a Wald test that would not be size correct. I think, I am not 100% sure, that even with a sandwich estimator the Wald test would not be size correct either. For this reason, we give the log-likelihood so one can perform a log-likelihood ratio test.

As for the Lasso multinomial regression, what is wrong with Tibshirani's implementation in gmlnet? The ones you suggest are faster?

Michail

tomwenseleers commented 1 year ago

Hi Michael, Yes I understand it would require quite a big change in the code - it was just that I was dealing with datasets with 15 million counts in total, which would make working with such a vectorized categorical variable unfeasible, whereas working with counts, incorporated as weights, should still work perfectly well. As for the vcov matrix - my interest in that would not just be to calculate confidence intervals or p levels on my coefficients, but I would also need it to calculate confidence intervals on the predicted values. I think in models with 1000s of coefficients, like the ones I am fitting, log likelihood ratio tests would become unfeasible, no? For regularised multinomial models glmnet does indeed work - it's just that it lacks a lot of the essentials that one would need for inference, e.g. the variance-covariance matrix etc... And it fits multinomial regressiins as a Poisson GLM using the Poisson trick, which is not very efficient when you have many outcome levels...

cheers, Tom