Developing "coupling graphs"

andkov commented 7 years ago

@ampiccinin @knighttime Objective:

andkov commented 7 years ago

See the script ./sandbox/coupling-graphs/coupling-graphs-1.R for the implementation of the graphs.

Graph 1

The graph show 48 random individuals (males) from MAP study. Red indicates process A (physical), black represents process B (cognitive). If a person has less then two data points no lines are drawn.

on more similar scale than original metric

not really sure what this means. Could you please elaborate?

Graph 2

The graph shows a random 48 ids from MAP study.

would it make sense to connect the dots?

here's a version of this graph with connected dots.

You can adjust or further develop graph in the ./sandbox/coupling-graphs/coupling-graphs-1.R script. I have prepared the functions to extract the data. Let me know if you have any questions!

ampiccinin commented 7 years ago

@andkov - wow! That was fast! Thanks!

Graph 1 is exactly what I meant. By more similar metric I meant that both would hover around zero, whereas their original raw metrics might be 0-44 versus 300-600.

Too bad they don't all look like person 198!!

I'm surprised that cognitive is black. I would have expected the physical measurements to be more variable.

For Graph 2, I wondered about a single summary graph that contained all individuals' data. That was why I suggested joining points within person - to have a sense of which points went together. Looking at person 198 again, though, shouldn't all their points fall between -1 and 1?

andkov commented 7 years ago

By more similar metric I meant that both would hover around zero, whereas their original raw metrics might be 0-44 versus 300-600.

ah. of course. this explains why the black (cognitive, in this case boston story immediate) is all over the place, while red (physical, in this case fev) is relatively stable. Here's the descriptives for the observed measures:

> summary(ds_wide$observed_A)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.420   1.720   2.145   2.134   2.560   3.930     749 
> summary(ds_wide$observed_B)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   8.000  10.000   9.609  11.000  12.000     395

Redisualizing helps reducing the span, but it still remain substantially different:

> summary(ds_wide$residual_A)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-1.0970 -0.0893  0.0033 -0.0001  0.1033  0.6330     749 
> summary(ds_wide$residual_B)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-7.3840 -0.9522  0.0680 -0.0001  0.9649  3.6590     395

And of course with other measures, this would be even greater. I turn these residual scores into z-scores, which should help this issue

ampiccinin commented 7 years ago

Thanks @andkov - I'm not sure we want z-scores. Our model is not standardized, and seeing how much a variable jumps around should be informative. When I said "more similar metric" I was indicating that it would be easy to plot with the same axis (both would have scores around 0).

My surprise about the cognitive residuals seeming larger than the physical was based on my memory of the observed raw trajectories. My sense had been that the physical measures jumped around quite a bit.

Are you able to resolve the apparent inconsistencies between Graphs 1 & 2 for person 198? Is it just my misinterpretation that is confusing me? Thanks!!

ampiccinin commented 7 years ago

@andkov ...are you saying that the physical looks like smaller variability because the original range is wider?

andkov commented 7 years ago

a slight diversion.

I have been troubled by such patterns as 153 (see graph 1), so I started looking at the individual level data. Here's the data for this person:

> ds_wide %>% 
+   dplyr::filter(id == 153) %>% 
+   dplyr::select(id,observed_A, observed_B, predicted_A, predicted_B, residual_A, residual_B)
   id observed_A observed_B predicted_A predicted_B residual_A residual_B
1 153         NA          8          NA    9.556000         NA -1.1525397
2 153         NA         10          NA    9.466136         NA  0.3954367
3 153         NA          8          NA    9.370914         NA -1.0154452
4 153         NA         12          NA    9.282460         NA  2.0129001
5 153       2.29          9    2.255416    9.184888  0.2008373 -0.1369478

Even though there is only one observation in A, the model still estimates. I wonder if the peculiar patterns we see in correlations may be due to the fact that many factor scores are coming from cases like these. I've looked at this study (MAP) to get these counts:

count_miss <- ds_wide %>% 
  dplyr::group_by(id) %>% 
  dplyr::mutate(
    n_nonmiss_a = length(observed_A) - sum(is.na(observed_A)),
    n_nonmiss_b = length(observed_B) - sum(is.na(observed_B))
  ) %>% 
  dplyr::ungroup() %>% 
  dplyr::distinct(id,n_nonmiss_a, n_nonmiss_b) 

count_miss %>% 
  dplyr::group_by(n_nonmiss_a) %>% 
  dplyr::count() %>% 
  dplyr::mutate(
    cumulative = cumsum(n)
  )
# A tibble: 6 × 3
  n_nonmiss_a     n cumulative
        <int> <int>      <int>
1           0    26         26
2           1    74        100
3           2    54        154
4           3    53        207
5           4    55        262
6           5    59        321

as you see out of 321 individuals, there were 26 that had no observation on process A and 100 that had up to 0 or 1 observations. For process B (cogntive) the story is similar:

> count_miss %>% 
+   dplyr::group_by(n_nonmiss_b) %>% 
+   dplyr::count() %>% 
+   dplyr::mutate(
+     cumulative = cumsum(n)
+   )
# A tibble: 5 × 3
  n_nonmiss_b     n cumulative
        <int> <int>      <int>
1           1    42         42
2           2    39         81
3           3    34        115
4           4    42        157
5           5   164        321

In fact, there are even cased in which we have 0 observed point on A, only 1 observed point in B and we still get an estimate for the factor score!

> count_miss %>% 
+   dplyr::group_by(n_nonmiss_a, n_nonmiss_b) %>% 
+   dplyr::count() %>% 
+   print(n=100)
Source: local data frame [21 x 3]
Groups: n_nonmiss_a [?]

   n_nonmiss_a n_nonmiss_b     n
         <int>       <int> <int>
1            0           1     9
2            0           2     6
3            0           3     5
4            0           4     3
5            0           5     3
6            1           1    33
7            1           2    19
8            1           3     4
9            1           4     3
10           1           5    15
11           2           2    14
12           2           3    14
13           2           4    10
14           2           5    16
15           3           3    11
16           3           4    16
17           3           5    26
18           4           4     9
19           4           5    46
20           5           4     1
21           5           5    58

Which is, I guess, because of the FIML option? To clarify, i'm using the observation THAT HAS BEEN USED BY MPLUS during the estimation, recovering it from the output.

ampiccinin commented 7 years ago

@andkov - damn. I'm sure there are people who would be fine with this. I am not one of them. Thanks for checking it!!! In theory (at least with MAR), the S-S correlation should be the same with and without these people who have data on only one or the other variable. Might be worth checking, though. What happens if you restrict to only folk who have at least 2 of each variable?

andkov commented 7 years ago

@ampiccinin , I can certainly do that. It will bring down the sample size, but not catastrophically.

andkov commented 7 years ago

Looking at person 198 again, though, shouldn't all their points fall between -1 and 1?

I wonder if it's just hard seeing it on this graph. Here's the data for person 198. There are almost between -1 and 1. Does that resolve your the inconsistency you thought there was in this case?

> ds_wide %>% 
+   dplyr::filter(id == 198) %>% 
+   dplyr::select(id,observed_A, observed_B, predicted_A, predicted_B, residual_A, residual_B)
   id observed_A observed_B predicted_A predicted_B residual_A residual_B
1 198      2.480         12    2.319000    11.09000   0.161000   0.910000
2 198      2.420         12    2.237832    11.01248   0.182168   0.987520
3 198      1.550         11    2.147230    10.92595  -0.597230   0.074050
4 198      2.285         12    2.060188    10.84282   0.224812   1.157180
5 198         NA         11          NA    10.75417         NA   0.245835

andkov commented 7 years ago

...are you saying that the physical looks like smaller variability because the original range is wider?

yes, exactly.

I'm not sure we want z-scores. Our model is not standardized, and seeing how much a variable jumps around should be informative. When I said "more similar metric" I was indicating that it would be easy to plot with the same axis (both would have scores around 0).

IMHO, i don't see a problem with linear transformation of the residuals (dividing by respective sd) for the sake of residual diagnostics. No to do any computation on them, but just adjust to VIEW them better. I agree, the fluctuation on the raw scale should be informative, but if our objective is simply to visually compare residuals, it seems that such standardization is permissible.

andkov commented 7 years ago

@ampiccinin , here are the modifications based on your comments:

Graph 1

Here, residual scores have been divided by respective sd of the process

Graph 2

Here, all individuals' residuals have been placed on the same XY canvass. Drawing the lines throw the points is, of course, possible, but i'm afraid not practical. After 10+ individuals it becomes a mess. Here's the same 48 persons, with color mapped onto id.

A more illuminating display would be, IMHO, to get rid of the lines, and map the value of a covariate onto the color, like we did in enhanced anatomy plots (e.g.). Here's a sketch of such a draft using all individuals and mapping the binary SMOKE covariate:

andkov commented 7 years ago

@ampiccinin

That was fast!

The graphing itself is relatively easy, once the data is extracted and shaped to the needed form. It's the data wrangling that takes 80-90% of the time in a normal graphing project. In this case, however, I've spent so much time learning these structures and developing custom functions to process them (and failing most of the times) that it became possible to turn it around so fast.

Portland is metaphor for the industrial age coming to data science: massive initial investment, but cheap, fast, and reliable production once the machinery is set up and calibrated (and engineers are trained!). You are starting to see that initial investment (giving me time and bringing @wibeasley along) paying off. I wish it were faster than that, but it seems that none of us had a clear idea of the magnitude of the engineering challenge the coordinated analysis presents on the scale you've envisioned. Well, retrospectively, I'd glad I was overly optimistic: not sure i'd have a heart to jump into something like that with a full disclosure, given the time lag in publications that it brings. :)

The bottom line: here's to your patience and my perseverance! 👍

ampiccinin commented 7 years ago

@andkov :)

you are absolutely right -

198 is fine. My confusion was due to failing eyesight...I thought the -0.6 was -6.0!! Might make sense to force the same axis for x & y?
the lines in graph 2 are a mess.

...Presumably the example we are looking at does not show (statistically significant) coupling (right?) and that we would see something more of an elliptical cloud if there were. I find it interesting that there seem to be cases with a positive association, and others with negative (and a bunch with no association or [mainly] not enough data on both variables). I suspect we would be going down a rabbit hole to chase whether there was something different about the + and - coupling people, but maybe when the Portland papers are submitted, if we are bored and looking for something to do, we could play around with this (with a strict time limit so we don't end up in too long a tea party with the Mad Hatter).

ampiccinin commented 7 years ago

@knighttime @jru7sh - what do you think of the latest "Graph2" to represent the coupling?

the only thing that bothers me ( @andkov ) is having multiple points per person in a single graph that does not differentiate among people...but I can live with this.

Can we/would it be useful to - use this for Jamie's BSIT project? Is it easy enough to modify the syntax for her variables?

andkov commented 7 years ago

Might make sense to force the same axis for x & y?

yes. good point. will adopt in the next iteration.

Is it easy enough to modify the syntax for her variables?

yes, in terms of graphing, these functions are transferable (i'm assuming @knighttime is dealing with bivariate growth models). However, if there are multiple studies + multiple models it might take some time to set up the production line (i'm talking 2-3 hours on my part + some time by the analyst to learn the set up)

andkov commented 7 years ago

@ampiccinin

Presumably the example we are looking at does not show (statistically significant) coupling (right?) and that we would see something more of an elliptical cloud if there were.

If my understanding of "coupling" is correct (association between the residuals of two processes), then that's exactly my interpretation of these patterns ( latest version of graph 2)

ampiccinin commented 7 years ago

@andkov : For now, @knighttime 's purpose is plots for a single study.

I hope we are then talking 2-3 minutes, rather than hours...

ampiccinin commented 7 years ago

@andkov - we could plot a significant association, such as OCTO female pulmonary block or mmse.

...but don't let me take you too far away from what you were otherwise planning to do today!! this can wait.

andkov commented 7 years ago

@ampiccinin

I hope we are then talking 2-3 minutes, rather than hours...

That's not an unreasonable estimate. However, it all boils down to the options @knighttime used for modeling, whether it permits easy extraction. It's easy to remedy, but 2-3 minutes is the estimate if everything is set up as needed.

knighttime commented 7 years ago

@andkov Thanks Andrey. This looks great!

My coupling model is using MAP.

I understand the code for the graph but it is pulling the model info from the helper functions - those I can't figure out.

Just looking at the code for the graph though, it looks like I can just substitute my own residual for cognition for "residual A" and residual for BSIT for "residual B" then it should work. is there a shorthand code to get that? resid(ds$cogn_ep) maybe? geom_line(aes(y=residual_A), color="red")+ `

I will try it out tomorrow and see how it goes

ampiccinin commented 7 years ago

@knighttime @andkov - possibly important detail: Jamie's models are in R. Presumably she can save model implied residuals and use them in your syntax that grabs them from Mplus? (The fact that you are running the models in different software just struck me now)

andkov commented 7 years ago

@ampiccinin @knighttime Yes, different model objects (R vs Mplus) require different plumbing. But that shouldn't affect the graphing function.

@knighttime, I think you are on the right track. But I doubt it would be as simple as that because the model shape is different (is it?). Best strategy is to study the data structure that goes into the graph and try to replicate that. Let me know if you get stuck.

I'm pulling the models from Mplus outputs, you can disregard what happens before graphing

IALSA / IALSA-2015-Portland