IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 102 forks source link

Multinomial distribution in R-Instat #1976

Open rdstern opened 7 years ago

rdstern commented 7 years ago

This first message is to explain what the multinomial distribution is, and why it could be important for R-Instat. The first "reply" gets down to more detail, even though I claim it is relatively low priority now. This issue is more for reference at this stage (and to prevent us spending time on piecemeal parts of the solution. It arises in R-Instat because data often includes factor columns and we make a big deal of them. there are two types of factor column, namely ordered and ordinary. We do need urgently to include ordered factors as a data type and I gather this is easy to do. In data sets if we have a column that corresponds to a "Yes", "No", question then a suitable model for a single response is called Bernoulli. If we ask about the number of "Yes" responses in a survey with 20 respondents, then we get to the binomial distribution. In teaching probability ideas, this corresponds to spinning a coin - heads/tails. We may have questions where there are more than 2 alternative answers. For example "Do you like R-Instat". Please give your answer on a 1 to 5 scale where 5="Very Much" and 1 is "Not at all". We now have 5 probabilities, i.e. for 1, 2, 3, 4, 5. The basic distribution is sometimes called categorical (Bernoulli is therefore a special case with just 2 choices). When we have 20 responses and want to know the distribution of the number of 1's etc, then this is the multinomial distribution. In teaching probability we swap our coin for a dice. It is therefore the natural distribution when we want to analyse factor columns. The structure of R-Instat seems OK for what may eventually be needed: a) We will be doing lots on tabulation, when Danny and David get down to it. Tables are indexed by factors - so that will be a good start. b) There has been a lot of recent work on visualising the resulting data - some using ggplot. So there is stuff to add to our graphics capabilities. c) In the model menu it should be easy(?) to add the multinomial distribution to the probability tables we can generate using the Model > Probability Distributions > Show Model dialogue. The big difference is that we would generating multiple columns of probabilities, because it is a multivariate distribution. d) In the Model > One Variable dialogues I think it could be added reasonably easily. e) In the more general modelling dialogues we can already cope with the situation where the x'variable is a factor. Here the y-variable is the factor being modelled. This needs to be added - see below.

Roger

rdstern commented 7 years ago

Let me start from the modelling, because David and Steve have already looked at this area. There is a good example here http://www.ats.ucla.edu/stat/r/dae/ologit.htm This is to model ordered factors. That's what we would use more commonly. For un-ordered factors see for example: http://www.ats.ucla.edu/stat/r/dae/mlogit.htm . That's where it describes "Multinomial Logistic Regression.

The common commands for the modelling seem to be multinom (from the nnet package) for the general case and polr (from the MASS package that we use already) for the ordinal case.

However Yee criticises this as being piecemeal and argues that his VGAM package for categorical data analysis provides a proper unified approach - a "down from the moutains" argument.

My conclusion is that this area will be a great topic for an MSc student project. And they might as well look at all aspects, so we leave the whole thing for then?

There also seem to be exciting developments on the graphical side. David knows Antony Unwin (Augsburg) who has a 2015 book called "Graphical Data Analysis with R". I suggest we do get this book - Unwin is a ggplot user. There are also packages, with extracat in particular.

This looks to be a great area. But, having got this far I realise it could also be very time-consuming. hence I suggest we have other more important priorities, with climatic in particular. So we should resist getting distracted by this area. It will be great fun, but it can wait!