IALSA / ialsa-2015-portland-stencil

Applying lessons from the Feb 2015 Portland workshop to a newer architecture
GNU General Public License v2.0
0 stars 0 forks source link

developing Syntax Creator #16

Open andkov opened 8 years ago

andkov commented 8 years ago

Objective flow: template Mplus script is fed to R, which produces modified Mplus scripts, designed for optimized parsing. The parser will reference this Syntax Creator, getting instructions on what to look for in the text output.

wibeasley commented 8 years ago

@andkov, I don't think it solves some of the issues we discussed yesterday, but it may help to use MplusAutomation::prepareMplusData anyway

library(MplusAutomation)
prepareMplusData(mtcars, "test02.dat", keepCols = c("mpg", "hp"))

If you run this snippet, make sure you eventually deleted the "test02.dat" file written tot he working directory.
andkov commented 8 years ago

We need to replace

Names are
 %names_are%
    ;

by the content of the variable_names.txt file:

fu_year
age_at_visit
cts_bname

so that the Mplus output reads

Names are
fu_year
age_at_visit
cts_bname
    ;

and not

Names are
fu_year  age_at_visit  cts_bname
    ;

I am trying to accomplish it by reading in the content of the text file and replace the catcher

      names_are <- read.csv(pathVarnames,header = F, stringsAsFactors = F)[ ,1]
      proto_input <- gsub(pattern = "%names_are%", replacement = names_are, x = proto_input)

(line 54-55 of the ./sandbox/syntax-creator/functions_to_generate_Mplus_scripts.R)

However, only the first line is being placed instead of a catcher

VARIABLE:
Names are
 fu_year
    ;

@wibeasley, it seems like a easy fix, but i'm blanking on it. Could you please take a look?

andkov commented 8 years ago

I would like to compute a new variable in a long data frame with respect to time.

df <- data.frame(id=c(111,111,111,222,222,333),
                 wave=c(0,1,2,0,1,0),
                 age_at_visit=c(10,11,12,10,12,10))
> df
   id wave age_at_visit
1 111    0  10
2 111    1  11
3 111    2  12
4 222    0  10
5 222    1  12
6 333    0  10

I need to compute a new variable time that would measure the the amount of time that has passed since the last visit. The results should look like

> new_df <- cbind(df, time=c(0,1,2,0,2,0))
> new_df
   id wave age_at_visit time
1 111    0           10    0
2 111    1           11    1
3 111    2           12    2
4 222    0           10    0
5 222    1           12    2
6 333    0           10    0

I thought that there was an elegant solution to this type of tasks that does not involve for loops. @wibeasley , could you please take a look?

wibeasley commented 8 years ago

How's this, @andkov? Some might argue that the first row of each id should be NA, not zero. Tell me if it needs to be zero, and we can wrap something in an ifelse block with dplyr::row_number()==1

library(magrittr)

ds <- data.frame(
  id           = c(111,111,111,222,222,333),
  wave         = c(0,1,2,0,1,0),
  age_at_visit = c(10,11,12,10,12,10)
  )
ds <- ds %>% 
  dplyr::group_by(id) %>% 
  dplyr::arrange(wave) %>% 
  dplyr::mutate(
    time_difference = (age_at_visit - lag(age_at_visit))
  ) %>% 
  dplyr::ungroup()
wibeasley commented 8 years ago

@andkov, here's a quick edit that uses dplyr::lag() instead of stats::lag(). It's more natural in these situations.

> library(magrittr)

> ds <- data.frame(
+   id           = c(111,111,111,222,222,333),
+   wave         = c(0,1,2,0,1,0),
+   age_at_visit = c(10,11,12,10,12,10)
+ )

> ds <- ds %>% 
+   dplyr::group_by(id) %>% 
+   dplyr::arrange(wave) %>% 
+   dplyr::mutate(
+     time_difference = (age_at_visit - dplyr::lag(age_at_visit, 1))
+   ) %>% 
+   dplyr::ungroup()

Source: local data frame [6 x 4]

     id  wave age_at_visit time_difference
  (dbl) (dbl)        (dbl)           (dbl)
1   111     0           10              NA
2   111     1           11               1
3   111     2           12               1
4   222     0           10              NA
5   222     1           12               2
6   333     0           10              NA
andkov commented 8 years ago

This is perfect, thanks @wibeasley. Thanks for clarifying the dplyr::lag function. But yes, it has to be a zero, not an NA.

wibeasley commented 8 years ago
ds <- ds %>% 
  dplyr::group_by(id) %>% 
  dplyr::arrange(wave) %>% 
  dplyr::mutate(
    time_difference = ifelse(
      seq_len(length(id))==1L,                    # Examine the row order within the subject id.
      0,                                          # Return zero for subject's first row.
      age_at_visit - dplyr::lag(age_at_visit, 1)  # Otherwise return the difference.
    )
  ) %>% 
  dplyr::ungroup()

output:

Source: local data frame [6 x 4]

     id  wave age_at_visit time_difference
  (dbl) (dbl)        (dbl)           (dbl)
1   111     0           10               0
2   111     1           11               1
3   111     2           12               1
4   222     0           10               0
5   222     1           12               2
6   333     0           10               0
andkov commented 8 years ago

@wibeasley , thanks a lot! this really helps, glad I can improve my dplyr syntax. although retrospectively it seems quite intuitive.

andkov commented 8 years ago

It seems i've reached a bottle neck. In the MODEL statement, when I'm specifying the random intercept and slope of the two processes:

MODEL:
    !first-level equation
    ! process A
ia sa | %process_a_timepoints% AT %estimated_timepoints%;
    ! process B
ib sb | %process_b_timepoints% AT %estimated_timepoints%;

the elements past the I seem to be required to be on the same line. The syntax that splits them into multiple lines:

MODEL:
    !first-level equation
    ! process A
ia sa | a1
a2
a3
a4
a5 AT time1
time2
time3
time4
time5;

can't get process by Mplus. However line like these can:

MODEL:
    !first-level equation
    ! process A
ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3  time4 time5;

We need either a different way to express the model specification or some other work around. @ampiccinin , @annierobi, @GracielaMuniz, can you suggest a way to reformulate the script so that the line

ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3  time4 time5;

can be distributed over individual lines? here's the complete output file for context. Anticipating the obvious solution, no I can't use

ia sa | a1-a5 AT time1-time5;

because the script needs to be explicit with respect to waves so that a1 a3 a6 wave sequences are possible too.

wibeasley commented 8 years ago

@andkov, I think we can get it on one Mplus line. Does this work for your needs? If I understand correctly, only the first line of code needs to be dynamic. The other three respond to whatever waves are passed.

waves <- c(1, 3, 6)

a <- paste(paste0("a", waves), collapse=" ")
times <-  paste(paste0("time", waves), collapse=" ")
paste("ia sa |", a, "AT", times, collapse=" ")

Output:

[1] "ia sa | a1 a3 a6 AT time1 time3 time6"

If the first line is changed to waves <- 1:5, the output is

[1] "ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5"
andkov commented 8 years ago

@wibeasley , yes. I forgot another caveat. the line cannot be longer than 90 characters. so when we have 16 waves... I'm not sure how to represent that many waves without resorting to time1-time16 expression.

GracielaMuniz commented 8 years ago

Have you tried using a1-a5 AT time1-time5 ? G

Sent from my iPhone

On 22 Feb 2016, at 03:54, Andriy V. Koval notifications@github.com<mailto:notifications@github.com> wrote:

It seems i've reached a bottle neck. In the MODEL statement, when I'm specifying the random intercept and slope of the two processes:

MODEL: !first-level equation ! process A ia sa | %process_a_timepoints% AT %estimated_timepoints%; ! process B ib sb | %process_b_timepoints% AT %estimated_timepoints%;

the elements past the I seem to be required to be on the same line. The syntax that splits them into multiple lines:

MODEL: !first-level equation ! process A ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

can't get process by Mplus. However line like these can:

MODEL: !first-level equation ! process A ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

We need either a different way to express the model specification or some other work around. @ampiccininhttps://github.com/ampiccinin , @annierobihttps://github.com/annierobi, @GracielaMunizhttps://github.com/GracielaMuniz, can you suggest a way to reformulate the script so that the line

ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

can be distributed over individual lines? here's the complete output filehttps://github.com/IALSA/ialsa-2015-portland-stencil/blob/master/sandbox/syntax-creator/outputs/grip-numbercomp/male_5.out for context. Anticipating the obvious solution, no I can't use

ia sa | a1-a5 AT time1-time5;

because the script needs to be explicit with respect to waves so that a1 a3 a6 wave sequences are possible too.

Reply to this email directly or view it on GitHubhttps://github.com/IALSA/ialsa-2015-portland-stencil/issues/16#issuecomment-186991862.

annierobi commented 8 years ago

I have put it on separate lines before in Mplus and it worked, maybe I was using fixed times score and that's why it worked. Usually you should be able to have on multiple lines. Have you tried:

ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

I can't try it on my computer now as I have another model running but can try later.

Annie

On Mon, Feb 22, 2016 at 2:18 AM, GracielaMuniz notifications@github.com wrote:

Have you tried using a1-a5 AT time1-time5 ? G

Sent from my iPhone

On 22 Feb 2016, at 03:54, Andriy V. Koval <notifications@github.com mailto:notifications@github.com> wrote:

It seems i've reached a bottle neck. In the MODEL statement, when I'm specifying the random intercept and slope of the two processes:

MODEL: !first-level equation ! process A ia sa | %process_a_timepoints% AT %estimated_timepoints%; ! process B ib sb | %process_b_timepoints% AT %estimated_timepoints%;

the elements past the I seem to be required to be on the same line. The syntax that splits them into multiple lines:

MODEL: !first-level equation ! process A ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

can't get process by Mplus. However line like these can:

MODEL: !first-level equation ! process A ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

We need either a different way to express the model specification or some other work around. @ampiccininhttps://github.com/ampiccinin , @annierobihttps://github.com/annierobi, @GracielaMuniz< https://github.com/GracielaMuniz>, can you suggest a way to reformulate the script so that the line

ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

can be distributed over individual lines? here's the complete output file< https://github.com/IALSA/ialsa-2015-portland-stencil/blob/master/sandbox/syntax-creator/outputs/grip-numbercomp/male_5.out> for context. Anticipating the obvious solution, no I can't use

ia sa | a1-a5 AT time1-time5;

because the script needs to be explicit with respect to waves so that a1 a3 a6 wave sequences are possible too.

Reply to this email directly or view it on GitHub< https://github.com/IALSA/ialsa-2015-portland-stencil/issues/16#issuecomment-186991862

.

— Reply to this email directly or view it on GitHub https://github.com/IALSA/ialsa-2015-portland-stencil/issues/16#issuecomment-187049173 .

wibeasley commented 8 years ago

Thanks @annierobi, that experience is helpful.

@andkov, my interpretation of the following links is consistent with @annierobi's advice

http://www.inside-r.org/packages/cran/stringr/docs/str_wrap http://www.statmodel.com/discussion/messages/9/9765.html?1409244035 http://www.ats.ucla.edu/stat/mplus/faq/faqs.htm (see "Can a command span over more than one line?")

requireNamespace("stringr")

waves <- c(1:13, 16)

a <- paste(paste0("a", waves), collapse=" ")
times <-  paste(paste0("time", waves), collapse=" ")
model_long <- paste("ia sa |", a, "AT", times, ";", collapse=" ")
model_wrap <- stringr::str_wrap(
  str = model_long, 
  width  = 35,   #Probably increase this to 85 in the real code
  exdent = 4
)

The string looks like this:

> model_wrap
[1] "ia sa | a1 a2 a3 a4 a5 a6 a7 a8 a9\n    a10 a11 a12 a13 a16 AT time1 time2\n    time3 time4 time5 time6 time7 time8\n    time9 time10 time11 time12 time13\n    time16 ;"

While the rendered text in the .inp file should look like this:

> cat(model_wrap)
ia sa | a1 a2 a3 a4 a5 a6 a7 a8 a9
    a10 a11 a12 a13 a16 AT time1 time2
    time3 time4 time5 time6 time7 time8
    time9 time10 time11 time12 time13
    time16 ;

If you wanted to make it a little more readable/debuggable, put the pipe and "AT" on their own line. This call to gsub() requires several escape sequences.

> model_wrap_pretty <- gsub("(AT|\\|)", "\n    \\1\n   ", model_wrap)
> cat(model_wrap_pretty)
ia sa 
    |
    a1 a2 a3 a4 a5 a6 a7 a8 a9
    a10 a11 a12 a13 a16 
    AT
    time1 time2
    time3 time4 time5 time6 time7 time8
    time9 time10 time11 time12 time13
    time16 ;
andkov commented 8 years ago

Thanks, @annierobi, when I express the model in the way you suggest

MODEL:
    !first-level equation
    ! process A
ia sa | a1 a2 a3 a4 a5 AT 
time1 time2 time3 time4 time5;

it returns the error

*** ERROR in MODEL command
  The number of fixed time scores is not sufficient for model identification
  in the following growth process:   IA SA

If it worked in your configuration, I wonder if that is because I explicitly assign timepoints in the define statement?

DEFINE:
    ! assign variables to the process p
a1=gripavg_1;
a2=gripavg_2;
a3=gripavg_3;
a4=gripavg_4;
a5=gripavg_5;
    !assign variables to the process c
b1=cts_nccrtd_1;
b2=cts_nccrtd_2;
b3=cts_nccrtd_3;
b4=cts_nccrtd_4;
b5=cts_nccrtd_5;
    !assign variables to time points
time1=time_since_bl_1;
time2=time_since_bl_2;
time3=time_since_bl_3;
time4=time_since_bl_4;
time5=time_since_bl_5;
andkov commented 8 years ago

@wibeasley , the wrapper you've developed

requireNamespace("stringr")

waves <- c(1:13, 16)

a <- paste(paste0("a", waves), collapse=" ")
times <-  paste(paste0("time", waves), collapse=" ")
model_long <- paste("ia sa |", a, "AT", times, ";", collapse=" ")
model_wrap <- stringr::str_wrap(
  str = model_long, 
  width  = 35,   #Probably increase this to 85 in the real code
  exdent = 4
)

is going to be very useful and opens up nice opportunities for me to keep the script easier to examine for a human.

I had no trouble of splitting commands among lines before. However , this specific expression

ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

cannot be broken in any way because the Mplus produces the error complaining:

*** ERROR in MODEL command
  The number of fixed time scores is not sufficient for model identification
  in the following growth process:   IA SA

If we want to be consistent with passing list of modelled waves, instead of indicating time1-time5 then this limitation presents an obstacle: with many waves it will run out of space on the line. So far, I couldn't find an example with this type of statement over multiple lines.

annierobi commented 8 years ago

I would email Mplus for support at support@statmodel.com. As mentioned, I know it works with fixed time points but it doesn't seem to work with random times. The Muthens specifically say that you can use multiple lines but I did come across someone who said it didn't work with random times on the message board. Linda's response was to email support. See below.

When I use this syntax:

USEVARIABLES ARE contsym1 contsym2 contsym3 contsym4 contsym5 aget1 aget2 aget3 aget4 aget5; TSCORES ARE aget1 aget2 aget3 aget4 aget5 ; ANALYSIS: type = random missing ; MODEL: i s| contsym1- contsym5 AT aget1 - aget5;

where contsym1 - contsym5 are continuous variables and aget1-aget5 are the individual ages for the people in the sample - this seems to work. However, this syntax:

USEVARIABLES ARE contsym1 contsym2 contsym3 contsym4 contsym5 aget1 aget2 aget3 aget4 aget5; TSCORES ARE aget1 aget2 aget3 aget4 aget5 ; ANALYSIS: type = random missing ; MODEL: i s | contsym1 contsym2 contsym3 contsym4 contsym5 at aget1 aget2 aget3 aget4 aget5;

OUTPUT: sampstat;

Generates this error:

*\ ERROR in Model command The number of fixed time scores is not sufficient for model identification in the following growth process: I S

I thought that those two syntaxes were identical and don't know what is going wrong.

I am also having trouble installing the patch to upgrade to version 3.12. I wasn't sure if I had the base + combination add-on, but if I try the others, I get an error that I have an invalid Mplus.exe in my path environment. When I try the combination, it starts to go through installing, but then says I need to have Mplus Base 3.0 +Combination installed.

Thanks for the help,

Jennie Linda K. Muthen support@statmodel.com posted on Saturday, June 18, 2005

On Mon, Feb 22, 2016 at 12:04 PM, Andriy V. Koval notifications@github.com wrote:

Thanks, @annierobi https://github.com/annierobi, when I express the model in the way you suggest

MODEL: !first-level equation ! process A ia sa | a1 a2 a3 a4 a5 AT time1 time2 time3 time4 time5;

it returns the error

*\ ERROR in MODEL command The number of fixed time scores is not sufficient for model identification in the following growth process: IA SA

If it worked in your configuration, I wonder if that is because I explicitly assign timepoints in the define statement?

DEFINE: ! assign variables to the process p a1=gripavg_1; a2=gripavg_2; a3=gripavg_3; a4=gripavg_4; a5=gripavg_5; !assign variables to the process c b1=cts_nccrtd_1; b2=cts_nccrtd_2; b3=cts_nccrtd_3; b4=cts_nccrtd_4; b5=cts_nccrtd_5; !assign variables to time points time1=time_since_bl_1; time2=time_since_bl_2; time3=time_since_bl_3; time4=time_since_bl_4; time5=time_since_bl_5;

— Reply to this email directly or view it on GitHub https://github.com/IALSA/ialsa-2015-portland-stencil/issues/16#issuecomment-187269691 .

andkov commented 8 years ago

Thanks, @annierobi. This is progress. I might be the limitation we'll have to work with. I though Linda was support. The date of June 18, 2005 isn't very promising either. To be specific, when you say fixed time points vs free time points you mean the assignment in the DEFINE statements?

    !assign variables to time points
time1=time_since_bl_1;
time2=time_since_bl_2;
time3=time_since_bl_3;
time4=time_since_bl_4;
time5=time_since_bl_5;

That is, this assignment makes the time point random?

annierobi commented 8 years ago

Yes, Linda is support :). Sorry I wasn't very clear about my fixed time. This is what I meant that I tried and that it worked: MODEL: i s q| y1@0 y2@1 y3@2 y4@3 y5@4 y6@5 y7@6 y8@7 y9@8 y10@9 y11@10;

On Mon, Feb 22, 2016 at 1:37 PM, Andriy V. Koval notifications@github.com wrote:

Thanks, @annierobi https://github.com/annierobi. This is progress. I might be the limitation we'll have to work with. I though Linda was support. The date of June 18, 2005 isn't very promising either. To be specific, when you say fixed time points vs free time points you mean the assignment in the DEFINE statements?

!assign variables to time points

time1=time_since_bl_1; time2=time_since_bl_2; time3=time_since_bl_3; time4=time_since_bl_4; time5=time_since_bl_5;

That is, this assignment makes the time point random?

— Reply to this email directly or view it on GitHub https://github.com/IALSA/ialsa-2015-portland-stencil/issues/16#issuecomment-187310448 .

andkov commented 8 years ago

This might work. The fixed time points can be inserted programmatically. So if this could be spelled out over multiple lines, we have our solution. It might not be too human-friendly, but it's a solutions. Thanks, @annierobi!

UPDATE: no, it didn't work. Expressing it like

MODEL:
ia sa | a1@1 a2@2 a3@3 a4@4 a5@5;
ib sb | b1@1 b2@2 b3@3 b4@4 b5@5;

produced the following error

*** ERROR in MODEL command
  One or more time score variables were not used.

And I'm not sure it could work. On the other hand, I just thought of another solution. Changing the definitions of the timepoints in the defined statement:

DEFINE:
time1=time_since_bl_1;
time2=time_since_bl_3;
time3=time_since_bl_5;
time4=time_since_bl_7;
time5=time_since_bl_9;
andkov commented 8 years ago

I have made a considerable progress in the syntax-creator script elsewhere, in wave-inclusion when we were thinking to go Mplus route with that project. Now I will transfer the script developments from wave-inclusion to ./sandbox/syntax-creator