Issues from benchmarking mclust and MPlus

jrosen48 commented 5 years ago

I've tried to show the results of benchmarking mclust and MPlus here:

https://data-edu.github.io/tidyLPA/articles/benchmarking-mclust-and-mplus.html

We can make some changes to how numbers are rounded, as they're presently quite different between the two packages.
Perhaps more importantly, even with a relatively simple data set, as models become more complex, it becomes more likely that they do not reach the same exact solution. I wonder if one way we can head this off is by demonstrating how the estimation can be controlled, i.e., by increasing the number iterations or changing the convergence criterion. Presently, we state that additional arguments can be passed to estimate_profiles() as follows: `

... Additional arguments are passed to the estimating function; i.e., Mclust, or mplusModeler.

Can we show an example of how to add these additional arguments for both estimating functions?

cjvanlissa commented 5 years ago

Point 1: We could change the standard digits argument to 2. Mplus always reports 3 digits, Mclust as many as R allows. Point 2: The main problem is that Mplus uses many sets of random starts because mixture models notoriously have convergence issues, and Mclust just uses one set of random starts. You can pass additional starts to Mplus through ..., but not to Mclust.

jrosen48 commented 5 years ago

Re point 1, that sounds good. Could you make this change when you have the chance? Re point 2... yeah. We do have some options that can be passed to mclust::Mclust() to control the estimation (i.e., the number of iterations allowed for the EM algorithm and the convergence tolerance: https://rdrr.io/cran/mclust/man/emControl.html). And, there are different ways that EM can be initialized: https://www.rdocumentation.org/packages/mclust/versions/5.4.1/topics/Mclust. But, I agree that these don't (easily) let us reproduce what MPlus does... and that it's probably not the best use of our time to go down that road. Nevertheless, I'd like to at least show how to pass additional starts (and possibly control other aspects of the estimation) for MPlus. Does that sound aight?

jrosen48 commented 5 years ago

Still to do: show how to pass additional arguments through to mclust/mplus

jrosen48 commented 5 years ago

assigned myself

jrosen48 commented 5 years ago

@cjvanlissa coming back to this; I'm trying to pass additional arguments on to MPlus, but the ... argument (or what I'm entering for it) isn't behaving as I expected, e.g.:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyLPA)
#> tidyLPA has received a major update, with a much easier workflow and improved functionality. However, you might have to update old syntax to account for the new workflow. See vignette('introduction-to-major-changes') for details!
#> 
#> Mplus is not installed. Use only package = 'mclust' when calling estimate_profiles().
iris_sample <- iris[c(1:4, 51:54, 101:104), ] # to make example run more quickly

iris_sample %>%
    subset(select = c("Sepal.Length", "Sepal.Width",
                      "Petal.Length")) %>%
    estimate_profiles(3, package = "mplus", "ANALYSIS: starts = 100,20;")
#> The 'variances'/'covariances' arguments were ignored in favor of the 'models' argument.
#> Warning in Ops.factor(mn, 2): '%%' not meaningful for factors
#> Warning in Ops.factor(mn, 2): '/' not meaningful for factors
#> Warning in runModels(target = modelout, Mplus_command = Mplus_command, logFile = NULL): Mplus returned error code: 1, for model: model_ANALYSIS: starts = 100,20;_class_3.inp
#> Warning: 
#> One or more analyses resulted in warnings! Examine these analyses carefully: model_ANALYSIS: starts = 100,20;_class_3
#> tidyLPA analysis using mplus: 
#> 
#>  Model Classes AIC BIC Entropy prob_min prob_max n_min n_max BLRT_p
#>  1     3

^{Created on 2019-10-10 by the reprex package (v0.3.0)}

Can you advise re: passing along additional arguments (such as starts = 100, 20)?

cjvanlissa commented 5 years ago

Have you tried taking the argument name out of quotation marks?

Gootjes commented 5 years ago

I am guessing the issue can be traced to here. No ... are passed to mplusModeler, although the documentation says it does.

cjvanlissa commented 5 years ago

@Gootjes thank you, fixed now!

Gootjes commented 5 years ago

Sorry for not checking my own solution properly before suggesting it. There is another line that triggers an error. The ... are stored in the Args variable, and this variable is passed to mplusObject before being passed on to mplusModeler later on. However, that function does not accept the same arguments as mplusModeler, leading to Mplus_command = "X:/Software/MPlus8/mplus.exe" being rejected:

> tidyLPA::estimate_profiles(data = d, package = "mplus", n_profiles = 3, models = 1, Mplus_command = "X:/Software/MPlus8/mplus.exe")

Error in (function (TITLE = NULL, DATA = NULL, VARIABLE = NULL, DEFINE = NULL,  : 
  unused argument (Mplus_command = "X:/Software/MPlus8/mplus.exe")

I do not know of an easy fix, except explicitly filtering valid arguments from ..., which should do the job.

cjvanlissa commented 5 years ago

OK, it will be a while before I get to it. If you want to make a pull request to fix it, that would be much appreciated!

jrosen48 commented 4 years ago

@Gootjes thank you for working on this issue. I tried to pass the argument "ANALYSIS: starts = 100,20;" but seem to be running into problems. I installed the latest development version (therefore with #133 merged).

library(tidyLPA)
#> tidyLPA is intended for academic use. We do not make any money on this and only ask that you please cite this in publications when you use the results. You can use the function citation('tidyLPA') to create a citation.Mplus is not installed. Use only package = 'mclust' when calling estimate_profiles().
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris_sample <- iris[c(1:4, 51:54, 101:104), ] # to make example run more quickly

# No arguments passed, works fine
iris_sample %>%
    subset(select = c("Sepal.Length", "Sepal.Width",
                      "Petal.Length")) %>%
    estimate_profiles(3, 
                      package = "mplus")
#> tidyLPA analysis using mplus: 
#> 
#>  Model Classes AIC   BIC   Entropy prob_min prob_max n_min n_max BLRT_p
#>  1     3       62.15 68.94 1.00    1.00     1.00     0.17  0.50  0.03

# With an argument passed on to mplusModeler, returns a warning
iris_sample %>%
    subset(select = c("Sepal.Length", "Sepal.Width",
                      "Petal.Length")) %>%
    estimate_profiles(3, 
                      package = "mplus", 
                      "ANALYSIS: starts = 100,20;")
#> The 'variances'/'covariances' arguments were ignored in favor of the 'models' argument.
#> Warning in Ops.factor(mn, 2): '%%' not meaningful for factors
#> Warning in Ops.factor(mn, 2): '/' not meaningful for factors
#> Warning in runModels(target = modelout, Mplus_command = Mplus_command, logFile = NULL): Mplus returned error code: 1, for model: model_ANALYSIS: starts = 100,20;_class_3.inp
#> Warning: 
#> One or more analyses resulted in warnings! Examine these analyses carefully: model_ANALYSIS: starts = 100,20;_class_3
#> tidyLPA analysis using mplus: 
#> 
#>  Model Classes AIC BIC Entropy prob_min prob_max n_min n_max BLRT_p
#>  1     3

^{Created on 2019-11-24 by the reprex package (v0.3.0)}

Thank you for your work on this!

cjvanlissa commented 4 years ago

You are passing an unnamed string as an argument. The formal argument of mplusModeler() is called ANALYSIS!

jrosen48 commented 4 years ago

Sorry, I'm slow *in general and just waking up (and am not seeing this documented in mplusAutomation::mplusModeler() or mplusAutomation::prepareMplusdata(); just focusing on this last line how could I pass an additional argument such as "ANALYSIS: starts = 100,20;" along to mplusModeler()?

cjvanlissa commented 4 years ago

The argument is called ANALYSIS, so you have to pass

ANALYSIS = "starts = 100, 20"

Instead of

"ANALYSIS: starts = 100,20;"

jrosen48 commented 4 years ago

What am I missing/doing incorrectly here?

library(tidyLPA)
#> tidyLPA is intended for academic use. We do not make any money on this and only ask that you please cite this in publications when you use the results. You can use the function citation('tidyLPA') to create a citation.Mplus is not installed. Use only package = 'mclust' when calling estimate_profiles().
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris_sample <- iris[c(1:4, 51:54, 101:104), ] # to make example run more quickly

# With an argument passed on to mplusModeler, returns a warning
iris_sample %>%
    subset(select = c("Sepal.Length", "Sepal.Width",
                      "Petal.Length")) %>%
    estimate_profiles(3, 
                      package = "mplus", 
                      ANALYSIS = "starts = 100,20;")
#> Error in prepareMplusData(df = data[i, , drop = FALSE], keepCols = object$usevariables, : unused argument (ANALYSIS = "starts = 100,20;")

^{Created on 2019-11-25 by the reprex package (v0.3.0)}

cjvanlissa commented 4 years ago

That syntax is correct, and if it's not working, then we have a bug! Does this fail on both the CRAN and development version?

jrosen48 commented 4 years ago

ah hah, yes. I know @Gootjes worked on a fix for this - wondering if it's specific to this argument? That fix (#133) was after we sent the last version to CRAN (2019-08-19), and so wouldn't reflect those changes... but the CRAN version (I installed via install.packages() and restarted) seems to fail too:

library(tidyLPA)
#> tidyLPA is intended for academic use. We do not make any money on this and only ask that you please cite this in publications when you use the results. You can use the function citation('tidyLPA') to create a citation.Mplus is not installed. Use only package = 'mclust' when calling estimate_profiles().
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris_sample <- iris[c(1:4, 51:54, 101:104), ] # to make example run more quickly

# With an argument passed on to mplusModeler, returns a warning
iris_sample %>%
    subset(select = c("Sepal.Length", "Sepal.Width",
                      "Petal.Length")) %>%
    estimate_profiles(3, 
                      package = "mplus", 
                      ANALYSIS = "starts = 100, 20;")
#> Warning in system2(Mplus_command, args = c(shQuote(inputSplit$filename)), :
#> error in running command
#> Warning in runModels(target = modelout, Mplus_command = Mplus_command, logFile = NULL): Mplus returned error code: 127, for model: model_1_class_3.inp
#> Error in getOutFileList(target, recursive, filefilter): Specified target does not exist.
#>   Target: model_1_class_3.out

^{Created on 2019-11-25 by the reprex package (v0.3.0)}

cjvanlissa commented 4 years ago

Let's discuss tomorrow! Should be an easy fix.

jrosen48 commented 4 years ago

Why I've been working on this - I want to demonstrate how to do this in our tidyLPA paper and also to demonstrate how to pass additional args to mplusModeler() or mclust(). Maybe cold comfort, but it seems to work for mclust(), as this demonstrates in the context of the prior argument:

library(tidyLPA)
#> tidyLPA is intended for academic use. We do not make any money on this and only ask that you please cite this in publications when you use the results. You can use the function citation('tidyLPA') to create a citation.Mplus is not installed. Use only package = 'mclust' when calling estimate_profiles().
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(mclust)
#> Package 'mclust' version 5.4.5
#> Type 'citation("mclust")' for citing this R package in publications.

# Passing arguments to mclust
iris[, -5] %>% estimate_profiles(2:4, model = 6, prior = priorControl())
#> Warning: You are using a deprecated argument when calling 'estimate_profiles.default'. Check the documentation. Deprecated arguments are:
#>   model: Instead, use the argument 'models', or 'variances' and 'covariances'.
#> The 'variances'/'covariances' arguments were ignored in favor of the 'models' argument.
#> tidyLPA analysis using mclust: 
#> 
#>  Model Classes AIC    BIC    Entropy prob_min prob_max n_min n_max BLRT_p
#>  6     2       505.20 592.51 1.00    1.00     1.00     0.33  0.67  0.01  
#>  6     3       473.40 605.87 0.96    0.97     1.00     0.32  0.35  0.01  
#>  6     4       487.30 664.93 0.96    0.98     0.99     0.04  0.36  0.66
summary(mclustBIC(iris[,-5], modelNames = "VVV"))
#> Best BIC values:
#>              VVV,2       VVV,3      VVV,4
#> BIC      -574.0178 -580.839630 -630.59996
#> BIC diff    0.0000   -6.821798  -56.58213

iris[, -5] %>% estimate_profiles(2:4, model = 6, prior = priorControl(shrinkage = 0))
#> Warning: You are using a deprecated argument when calling 'estimate_profiles.default'. Check the documentation. Deprecated arguments are:
#>   model: Instead, use the argument 'models', or 'variances' and 'covariances'.
#> The 'variances'/'covariances' arguments were ignored in favor of the 'models' argument.
#> tidyLPA analysis using mclust: 
#> 
#>  Model Classes AIC    BIC    Entropy prob_min prob_max n_min n_max BLRT_p
#>  6     2       504.76 592.07 1.00    1.00     1.00     0.33  0.67  0.01  
#>  6     3       472.93 605.40 0.96    0.97     1.00     0.32  0.35  0.01  
#>  6     4       486.66 664.29 0.96    0.98     0.99     0.04  0.36  0.73
summary(mclustBIC(iris[,-5], modelNames = "VVV", prior = priorControl(shrinkage = 0)))
#> Best BIC values:
#>              VVV,2      VVV,3      VVV,4
#> BIC      -592.0713 -605.39648 -664.28888
#> BIC diff    0.0000  -13.32516  -72.21757

^{Created on 2019-11-25 by the reprex package (v0.3.0)}

Gootjes commented 4 years ago

Hopefully fixed with #134.

When I wrote the previous fix, I assumed wrongly that all arguments from ... could be passed to mplusModeler. Removing/excluding arguments from ... that have been used in mplusObject (such as the ANALYSIS statement) when calling mplusModeler should fix it, see pull request #134. Maybe you can check that out to see if it fixes your issue?

The second issue you reported (the one with the CRAN version) originates in a different function call, which is the exact error I get when I misspecified the Mplus_command argument that is used down the line in mplusModeler, as my mplus installation is in an odd directory. Is that the case for you as well? If not, we have another bug, and my original fix does not fix your issue. Let me know!

cjvanlissa commented 4 years ago

Thanks Gootjes, I figured as much... appreciate your work on this, I'm gonna check the code after december 5th!

jrosen48 commented 4 years ago

@Gootjes thanks for all of your work on this. I merged the PR and ran again - that same issue is coming up:

library(tidyLPA)
#> tidyLPA is intended for academic use. We do not make any money on this and only ask that you please cite this in publications when you use the results. You can use the function citation('tidyLPA') to create a citation.Mplus is not installed. Use only package = 'mclust' when calling estimate_profiles().
library(dplyr, warn.conflicts = FALSE)

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(3, package = "MplusAutomation", 
                      ANALYSIS = "starts = 100, 20;")
#> Error in prepareMplusData(df = data[i, , drop = FALSE], keepCols = object$usevariables, : unused argument (ANALYSIS = "starts = 100, 20;")

^{Created on 2019-11-30 by the reprex package (v0.3.0)}

Gootjes commented 4 years ago

Using the current master branch, I can no longer reproduce the issue. Perhaps your code is accidentally using an older version? These version checksums should be the same as yours:

# This displays your currently installed version (installed from github):
remotes:::local_sha("tidyLPA")
#> [1] "f3e47446b118dc4581123f262996b30c9df0d07c"

# This displays the latest version that is on github.
remotes:::remote_sha(remotes:::github_remote("data-edu/tidyLPA"))
#> [1] "f3e47446b118dc4581123f262996b30c9df0d07c"

^{Created on 2019-12-01 by the reprex package (v0.3.0)}

library(tidyLPA)
#> tidyLPA is intended for academic use. We do not make any money on this and only ask that you please cite this in publications when you use the results. You can use the function citation('tidyLPA') to create a citation.Mplus is installed; you can use package = 'MplusAutomation' when calling estimate_profiles().
library(dplyr, warn.conflicts = FALSE)

pisaUSA15[1:100, ] %>%
    select(broad_interest, enjoyment, self_efficacy) %>%
    single_imputation() %>%
    estimate_profiles(3, package = "MplusAutomation", 
                      ANALYSIS = "starts = 100, 20;")
#> tidyLPA analysis using mplus: 
#> 
#>  Model Classes AIC    BIC    Entropy prob_min prob_max n_min n_max BLRT_p
#>  1     3       626.07 662.54 0.81    0.84     0.95     0.03  0.67  0.00

^{Created on 2019-12-01 by the reprex package (v0.3.0)}

cjvanlissa commented 4 years ago

@jrosen48 your code works for me, too. I'm developing unit tests to check that the additional arguments are also being correctly evaluated.

cjvanlissa commented 4 years ago

We can make some changes to how numbers are rounded, as they're presently quite different between the two packages.

Actually, I think every print function correctly rounds the numbers. The raw data themselves are (of course) not rounded. Your vignette for comparing Mclust and Mplus shows different decimals, because you're using as.data.frame() on the tibble returned by get_fit(). This triggers the print.data.frame() function instead of print.tibble().

I recommend that, for this vignette, you print using knitr::kable() or even DT::datatable().

cjvanlissa commented 4 years ago

PS: I'm closing this thread because it is very lengthy. Let's reopen new, more specific issues for benchmarking, with one issue per thread!

data-edu / tidyLPA

Issues from benchmarking mclust and MPlus #123