Using betta to test differences the delta estimates of times-points

mestaki commented 4 years ago

Hi breakway team,

I have a question regarding extending the use of betta

example data: 2 groups of patients (healthy, disease), that have been sampled at 2 time points (1 and 2) I am wondering if its possible to:

a) test whether a change in richness has occurred in each group separately compared to 0. So, delta richness=richness at time point 2 - richness at time point 1; and compare each groups' average to 0 a.k.a no change.

b) Whether the magnitude of change is different between groups, so again using delta richness values but this time compare across groups instead of zero.

Currently I just use the estimates and run tests on them without accounting for their errors.

Is there a sensible way of doing this using betta so that the errors are utilized?

Thanks again for this awesome package!

adw96 commented 4 years ago

Oh the embarassment of not seeing this until now.... So sorry, Bod! @mestaki

Yes, this is absolutely possible and actually pretty easy. Here's an example of how you can do it

library(tidyverse)
library(breakaway)
set.seed(1)
example_richness <- rnorm(20, mean = 1000, sd = 100)
example_ses <- rnorm(20, mean = 100, sd = 15)
tb <- tibble("cs" = example_richness, 
       "ses" = example_ses, 
       "time" = rep(c(0,1), each = 10), 
       "disease" = rep(c(0,1), 10))

betta(chats = example_richness,
      ses = example_ses,
      X = model.matrix(~time*disease, data = tb))$table

Let me know if you have any questions! and sorry again for the delay eeeeek! (The summer was a total wash for software development)

mestaki commented 4 years ago

Hi @adw96, No worries at all, appreciate you getting around to it eventually though! The dataset that made me ask this question is already published now, but I am still very curious about this, so allow me to expand, and I apologize in advance if this is more related to understanding linear models in general vs betta specifically.

So here's a case example building from yours:

library(tidyverse)
set.seed(10)
example_richness <- c(rnorm(8, mean = 100, sd = 40), #healthy, week1
                      rnorm(8, mean = 100, sd = 40), #disease, week1
                      rnorm(8, mean = 150, sd = 40), #healthy, week2 (increased)
                      rnorm(8, mean = 100, sd = 40)) #disease, week2 (no increase)
example_ses <- c(rnorm(8, mean = 100, sd = 15), #all the same
                 rnorm(8, mean = 100, sd = 15),
                 rnorm(8, mean = 100, sd = 15),
                 rnorm(8, mean = 100, sd = 15))
tb <- tibble("subject"=rep(letters[seq(1:16)], 2),
             "cs" = example_richness, 
             "ses" = example_ses, 
             "time" = rep(c("week1","week2"), each = 16), 
             "disease" = rep(c("healthy", "disease"), each = 8, len = 32))
tb %>% head

# A tibble: 6 x 5
  subject    cs   ses time  disease
  <chr>   <dbl> <dbl> <chr> <chr>  
1 a       143.   75.9 week1 healthy
2 b       149.  113.  week1 healthy
3 c       129.  102.  week1 healthy
4 d        80.8 118.  week1 healthy
5 e       123.   88.6 week1 healthy
6 f        50.1 106.  week1 healthy

So in a typical lm model what you suggested would be equal to:

(mod1 <- lm(cs~time*disease, data=tb) %>% anova)

Analysis of Variance Table

Response: cs
             Df Sum Sq Mean Sq F value    Pr(>F)    
time          1  12853 12852.9  8.8727 0.0059208 ** 
disease       1  21160 21160.2 14.6075 0.0006759 ***
time:disease  1   8343  8342.7  5.7592 0.0232980 *  
Residuals    28  40560  1448.6                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Which takes into account both the effect of treatment, time, and their interactions. But we could also frame the question in another way by looking only at the difference in cs between time points:

tb2 <- tb %>% 
  pivot_wider(., id_cols=c(subject,disease), names_from=time, values_from=c("cs","ses")) %>% 
  mutate(delta_cs = cs_week2-cs_week1) #calculating difference between week 2 and week 1

tb2 %>% head

# A tibble: 6 x 7
  subject disease cs_week1 cs_week2 ses_week1 ses_week2 delta_cs
  <chr>   <chr>      <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
1 a       healthy    143.      174.      75.9      93.0    31.1 
2 b       healthy    149.      141.     113.       95.3    -7.55
3 c       healthy    129.      179.     102.      114.     49.1 
4 d       healthy     80.8     179.     118.      101.     97.9 
5 e       healthy    123.      168.      88.6     116.     45.1 
6 f       healthy     50.1     156.     106.      111.    106.  

(mod2 <- lm(delta_cs ~ disease, data=tb2) %>% anova)

Analysis of Variance Table

Response: delta_cs
          Df Sum Sq Mean Sq F value  Pr(>F)  
disease    1  16685 16685.4  4.0947 0.06254 .
Residuals 14  57049  4074.9                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now, I get that paired tests do actually work very similarly to this in that the difference in time is calculated but do they answer the same question exactly? Because I see the interpretations of results between these two approaches could almost be opposite, right? In mod1 we would interpret that there was an effect of time, disease, and their interactions. But in mod2 we would conclude that there was no change in cs between groups across time.

So, going back to betta, it is easy enough to implement mod1 -as you mentioned- with :

(betta1 <- breakaway::betta(chats = example_richness,
      ses = example_ses,
      X = model.matrix(~time*disease, data = tb))$table)

But, if I wanted to go the mod2 approach, is there a way to get a delta_ses? I imagine just subtracting ses_time2 from ses_time1 wouldn't really make sense, right? This is where my question stems from, hope this makes sense!

adw96 commented 3 years ago

Hi Bod! Good question! In general, for two random variables X1 and X2, the variance in (X1-X2) is Var(X1) + Var(X2) - 2 * Cov(X1, X2) In the case where you are taking the difference between two estimators, you can get reasonable estimates of VarX1 and VarX2, but the challenge is in estimating the covariance. Unfortunately this calculation is going to be complicated in most scenarios, because if your data is autocorrelated then it's likely to induce correlation in the estimators. Since here your estimators of species richness take a complicated form, this is going to be hard to do!

If you really want to go about this, you could make the assumption that the covariance in the estimators is positively correlated (which I think is very likely), and then you can construct an upper bound on the variance. Depending on precisely what claims you want to make, I would think you could have a hypothesis testing procedure that controls Type 1 error, even if it is conservative.

I'm sure you already know about this, but I usually recommend betta_random for folks who have longitudinal data.

I hope that helps! I'm going to close this issue since it is not a breakaway, but will do my best to answer more questions if you have them.

Cheers,

Amy

mestaki commented 3 years ago

Thanks for taking the time to explain this, even though it was mainly outside of betta's domain. I think I understand your approach, and as you said that does sound a lot more complicated than its worth, given that a much simpler approach is right in front of us. Also, I tried re-running the code I posted with a few other random seeds and it seems like I may have just gotten unlucky with that particular random subset, because in almost all the other variations the results of the two models perfectly agree! So I feel better about that now too. No further questions for now :)

adw96 / breakaway

Using betta to test differences the delta estimates of times-points #90