JuliaStats / MixedModels.jl

A Julia package for fitting (statistical) mixed-effects models
http://juliastats.org/MixedModels.jl/stable
MIT License
407 stars 48 forks source link

Finding your way around for R users #1

Open dmbates opened 11 years ago

dmbates commented 11 years ago

In private email Martin Maechler asked me several questions about using Julia and the MixedModels package. Because the answers to which may be of more general interest, I created this issue for them so that others may see the answers and comment on them.

I am assuming that a recent version of Julia 0.2 devel is installed. The easiest way to do that for Ubuntu users is using the julianightlies PPA and the julia-deps PPA. As of today, the "nightlies" version produces

$ /usr/bin/julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0+nightly20130502.9a72fab
 _/ |\__'_|_|_|\__'_|  |  -Linux-x86_64 (2013-05-03 19:14:05)
|__/                   |

on startup. There have been too many changes since the 0.1.2 release to try using that. Please ensure that your version string starts with 0.2.0

The system for attaching packages has changed in version 0.2.0. To install this package use

Pkg.add("MixedModels")

After that, attaching this and other packages is done with the using directive.

using MixedModels

More than one package can be listed. For testing I often use

using RDatasets, MixedModels

This directive doesn't return instantaneously but the delay is due to the size and complexity of the DataFrames package upon which both of these depend.

As in R, attaching a package also attaches its dependencies but there are distinctions between those packages that are visible from the REPL and those that are visible only inside the package listed in using. The equivalent of R's ls() is whos() and it can take an argument which is the name of a Module. (Most packages are organized as a Module - which is a namespace - but you can have Modules that are not separate packages).

julia> using MixedModels

julia> whos()
Base                          Module
Core                          Module
DataFrames                    Module
Distributions                 Module
GZip                          Module
Main                          Module
MixedModels                   Module
NLopt                         Module
OptionsMod                    Module
Stats                         Module

julia> whos(MixedModels)
LMMsimple                     DataType
LMMsplit                      DataType
MixedModel                    DataType
MixedModels                   Module
SparseMatrixRSC               DataType
VarCorr                       Function
fixef                         Function
ranef                         Function
reml                          Function

It happens that MixedModels attaches the DataFrames and Distributions namespaces in the REPL as well as its own namespace. However is does not the NLopt namespace in the REPL. I am not sure how to discover this other than by reading the source code or experimenting with names. The directive using MixedModels evaluates the source file

joinpath(Pkg.dir(), "MixedModels", "src", "MixedModels.jl")

which, in turn, has several calls to include

include("RSC.jl")           # regular sparse column-oriented matrices
include("utils.jl")         # utilities to deal with the model formula
include("linearmixedmodel.jl")
include("LMM.jl")

The fact that this driver file has the directive

using DataFrames, Distributions

both before and after the directive

module MixedModels

means that these namespaces will be visible in the REPL (because of the first using) and withing the module itself, because of the second using.

You can always try evaluating a name in the namespace to see if it is visible in the REPL.

julia> whos(NLopt)
ForcedStop                    DataType
NLOPT_VERSION                 VersionNumber
NLopt                         Module
Opt                           DataType
algorithm                     Function
algorithm_name                Function
default_initial_step!         Function
equality_constraint!          Function
force_stop                    Function
force_stop!                   Function
ftol_abs                      Function
ftol_abs!                     Function
ftol_rel                      Function
ftol_rel!                     Function
inequality_constraint!        Function
initial_step                  Function
initial_step!                 Function
local_optimizer!              Function
lower_bounds                  Function
lower_bounds!                 Function
max_objective!                Function
maxeval                       Function
maxeval!                      Function
maxtime                       Function
maxtime!                      Function
min_objective!                Function
optimize                      Function
optimize!                     Function
population                    Function
population!                   Function
remove_constraints!           Function
stopval                       Function
stopval!                      Function
upper_bounds                  Function
upper_bounds!                 Function
vector_storage                Function
vector_storage!               Function
xtol_abs                      Function
xtol_abs!                     Function
xtol_rel                      Function
xtol_rel!                     Function

julia> algorithm
ERROR: algorithm not defined

The fully qualified name will be accessible

julia> NLopt.algorithm
# methods for generic function algorithm
algorithm(o::Opt) at /home/bates/.julia/NLopt/src/NLopt.jl:143

Martin asked about the equivalent of R's installed.packages() which is

julia> Pkg.installed()
["GLM"=>"aefc6c1c616f784ddec920ea07ee6e3ab712bb50","Tk"=>v"0.0.1","MathProg"=>v"0.0.0","MAT"=>v"0.2.0","TextWrap"=>v"0.0.0","NLopt"=>v"0.0.0","Color"=>v"0.2.0","Mustache"=>v"0.0.0","Benchmark"=>v"0.0.0","Zlib"=>v"0.0.0","Debug"=>v"0.0.0","PyCall"=>v"0.0.0","Graphs"=>v"0.2.2","Gadfly"=>v"0.0.0","Compose"=>v"0.0.0","Metis"=>v"0.0.0","Cubature"=>v"0.1.0","CoinMP"=>v"0.0.1","GetC"=>v"0.0.0","Winston"=>v"0.0.0","RDatasets"=>"9000825fff7b059afc1973201d48e46521990a69","Profile"=>v"0.2.0","DataFrames"=>v"0.2.1","GZip"=>v"0.2.0","StrPack"=>v"0.0.0","IniFile"=>v"0.0.0","JSON"=>v"0.1.0","BinDeps"=>v"0.0.1","OpenGL"=>v"0.0.0","Rif"=>v"0.0.0","Iterators"=>v"0.0.0","Distributions"=>"1ba699146607c6c0a41e9b2c2266aee4c99e0141","HDF5"=>"8b7344bee440f3467adb2be6afe478677b25f224","Stats"=>v"0.2.0","Codecs"=>v"0.0.0","Clp"=>v"0.0.1","SemidefiniteProgramming"=>v"0.0.0","Clang"=>"6e995712fc89700190d7f79aed09e94e10752188","Options"=>v"0.2.0","ArgParse"=>v"0.2.1","Optim"=>"1dd33924ef113d960d8ffe981519ea78558f8681","DataStructures"=>v"0.2.3","Cairo"=>v"0.0.1","MathProgBase"=>v"0.0.0","Calculus"=>v"0.1.1"]

which is a hash or Dict

julia> typeof(Pkg.installed())
Dict{String,Union(String,VersionNumber)}

A more readable display is returned by

julia> Pkg.status()
ArgParse         0.2.1
Benchmark        0.0.0
BinDeps          0.0.1
Cairo            0.0.1
Clang            master
Clp              0.0.1
Codecs           0.0.0
CoinMP           0.0.1
Color            0.2.0
Compose          0.0.0
Cubature         0.1.0
Debug            0.0.0
Distributions    master
GLM              master
GZip             0.2.0
Gadfly           0.0.0
GetC             0.0.0
Graphs           master
HDF5             master
Iterators        0.0.0
JSON             0.1.0
MAT              0.2.0
MathProg         0.0.0
Metis            0.0.0
Mustache         0.0.0
NLopt            0.0.0
OpenGL           0.0.0
Optim            master
Options          0.2.0
Profile          master
PyCall           0.0.0
RDatasets        master (dirty)
Rif              0.0.0
SemidefiniteProgramming 0.0.0
Stats            0.2.0
StrPack          0.0.0
TextWrap         0.0.0
Tk               0.0.1
Winston          0.0.0
DataFrames       master (dirty)
DataStructures   0.2.3
Zlib             0.0.0
MathProgBase     0.0.0
Calculus         master
IniFile          0.0.0

I will be adding an lmer() function soon. Right now the package has two user-defined types for simple linear mixed models, LMMsimple and LMMsplit because I was experimenting to see if the complexity of handling the fixed-effects part separately from the random-effects part of the model was warranted (and apparently it is). The LMMsimple class throws both into a single sparse matrix and updatemu() for that class is only four lines.

## Update L, solve for ubeta and evaluate mu
function updatemu(m::LMMsimple)
    m.A.nzval[:] = m.anzv[:]            # restore A to ZXt*ZXt', update A and L
    chm_factorize!(m.L, pluseye!(chm_scale!(m.A, m.lambda, 3), 1:m.Zt.q))
    m.ubeta[:] = (m.L \ (m.lambda .* m.Zty))[:]
    m.mu[:] = m.Zt'*(m.lambda .* m.ubeta)
end

For LMMsplit the random-effects part of the system matrix is stored as a sparse symmetric matrix and the fixed-effects part is a dense matrix.

The formula/data interface is slightly more verbose than in R. A formula must be enclosed in :( ) to retain it as an expression. Thus

julia> using RDatasets

julia> ds = data("lme4", "Dyestuff")
Written by version 2.15.3
Minimal R version: 2.3.0
WARNING: has(d,k) is deprecated, use haskey(d,k) instead.
 in make_unique at /home/bates/.julia/DataFrames/src/utils.jl:47
30x2 DataFrame:
         Batch  Yield
[1,]       "A" 1545.0
[2,]       "A" 1440.0
[3,]       "A" 1440.0
[4,]       "A" 1520.0
[5,]       "A" 1580.0
[6,]       "B" 1540.0
[7,]       "B" 1555.0
[8,]       "B" 1490.0
[9,]       "B" 1560.0
[10,]      "B" 1495.0
[11,]      "C" 1595.0
[12,]      "C" 1550.0
[13,]      "C" 1605.0
[14,]      "C" 1510.0
[15,]      "C" 1560.0
[16,]      "D" 1445.0
[17,]      "D" 1440.0
[18,]      "D" 1595.0
[19,]      "D" 1465.0
[20,]      "D" 1545.0
[21,]      "E" 1595.0
[22,]      "E" 1630.0
[23,]      "E" 1515.0
[24,]      "E" 1635.0
[25,]      "E" 1625.0
[26,]      "F" 1520.0
[27,]      "F" 1455.0
[28,]      "F" 1450.0
[29,]      "F" 1480.0
[30,]      "F" 1445.0

julia> fm1 = LMMsplit(:(Yield ~ 1|Batch), ds)
Linear mixed model fit by maximum likelihood
 logLik: -163.66352994061094, deviance: 327.3270598812219

  Variance components: [1388.34, 2451.25]
  Number of obs: 30; levels of grouping factors: [6]
  Fixed-effects parameters: [1527.5]

The first time you do this the response will seem slow, which is because all the methods are being encountered for the first time and need to be compiled. Subsequent calls to will be much faster.

mmaechler commented 11 years ago

That was really very useful. whos() is now a "must know" for me.

Interesting that my julia version of April 26, was "way too old" and e.g. couldn't get data(.,.) to work... Using the Ubuntu PPAs instead indeed, was sufficient.... and much faster than the 'git pull ; make' ... aah it fails re-building 'make realclean; make' ... dance which would usually take many hours..

Thank you, Doug!

dmbates commented 8 years ago

In Julia multiple values are returned from a function as a tuple. You can write

return (a, b)

and extract the values a and b with indexes. A common idiom is to assign the values of a returned function to a tuple of identifiers

n, p = size(X)

On Wed, Apr 20, 2016 at 12:13 AM Daijiang Li notifications@github.com wrote:

How do you return multiple named values so you can extract part of them later? In R, we can use return(list(a = a, b = b)). Thanks!

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/dmbates/MixedModels.jl/issues/1#issuecomment-212259714

daijiang commented 8 years ago

Thanks Prof. Bates! I ended with return a Dict so I can extract values by name later. I am converting Tony Ives' R code to do phylogenetic mixed models (mixed model with specific phylogenetic var-cov matrix) into Julia. As the R code is too slow and it is unlikely to be included into the lme4 package.

I am new to Julia, even though I attended your Julia workshop two years ago... The code is here. Even now, it is >10 times faster than R. But I wonder can you have a look at them and let me know how to improve/profile it?

Is it possible to let your mixed model to do such kind of analyses?

Thank you for your help! Daijiang

dmbates commented 8 years ago

First I would suggest creating a package. Secondly don't pass around a huge number of arguments. If they are related to the same model then create a type for the model.

On Thu, Apr 21, 2016 at 10:31 AM Daijiang Li notifications@github.com wrote:

Thanks Prof. Bates! I ended with return a Dict so I can extract values by name later. I am converting Tony Ives' R code to do phylogenetic mixed models (mixed model with specific phylogenetic var-cov matrix) into Julia. As the R code is too slow and it is unlikely to be included into the lme4 package.

I am new to Julia, even though I attended your Julia workshop two years ago... The code is here https://github.com/daijiang/pglmm.jl. Even now, it is >10 times faster than R. But I wonder can you have a look at them and let me know how to improve/profile it?

Is it possible to let your mixed model to do such kind of analyses?

Thank you for your help!

Daijiang

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/dmbates/MixedModels.jl/issues/1#issuecomment-212975338