Repeated measures function

LI-Yixuan commented 3 years ago

Repeated measures function achieved: by default, the effect size is obtained via baseline subtraction.

josesho commented 3 years ago

Thanks for this.

Before we merge this, a few things to be done:

You've added a is_preceding keyword. Since this is a repeated measures plot, can we instead change it to repeated_measure instead? This is more informative than is_preceding.
After you've done that, could you attach here a minimal reproducible code example to demonstrate what the expected use case and outcome is?
Then, you will want to add a unit test for this new functionality. Ping back here if you need help!
Finally, we'll need to document this inline, as well as in the official docs. Again if you need any assistance, ping back here!

Looking forward to your additions to this PR!

Joses

josesho commented 3 years ago

Hi @LI-Yixuan ,

could you include a copy-pastable example on what new functionality is included, and how to use?

eg:

import dabest
dabest.load(iris)
# continue code...

LI-Yixuan commented 3 years ago

Hi! Here is an example:

import numpy as np
import pandas as pd
import dabest
import pylab
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create samples
d0 = norm.rvs(loc=3, scale=0.4, size=Ns)
d1 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d2 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d3 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
d4 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
d5 = norm.rvs(loc=3, scale=0.75, size=Ns)
d6 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d7 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d8 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Day0' : d0,     'Day1' : d1,
                     'Day2' : d2,     'Day3' : d3,
                     'Day4' : d4,     'Day5' : d5,
                     'Day6' : d6,     'Day7' : d7,
                    'Day8' : d8,
                     'Gender': gender, 'ID'  : id_col
                    })
#################### test on repeated_measure = baseline example
baseline = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3","Day4"),
                                       ("Day5", "Day6","Day7", "Day8")), repeated_measures = "baseline"
                                     )
baseline.cohens_d.plot(color_col="Gender")
pylab.show()

#####################repeated_measure = sequential example
sequential = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3","Day4"),
                                       ("Day5", "Day6","Day7", "Day8")), repeated_measures = "sequential"
                                     )
print(sequential.median_diff)
sequential.median_diff.plot(color_col="Gender")
pylab.show()

sequential

josesho commented 3 years ago

Hi @LI-Yixuan ,

Thanks,

Can you demonstrate the syntax (and plots) in the case where the datasets are not paired?

Joses

adamcc commented 3 years ago

Looking good!

Shouldn't both baseline and sequential use the slopegraph for the observed values?

LI-Yixuan commented 3 years ago

Hi @LI-Yixuan ,

Thanks,

Can you demonstrate the syntax (and plots) in the case where the datasets are not paired?

Joses

Hi! The 'paired' parameter is used in the same way as the current package. Here is an example for unpaired data:

import numpy as np
import pandas as pd
import dabest
import pylab

from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create samples
d0 = norm.rvs(loc=3, scale=0.4, size=Ns)
d1 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d2 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d3 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
d4 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
d5 = norm.rvs(loc=3, scale=0.75, size=Ns)
d6 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d7 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d8 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Day0' : d0,     'Day1' : d1,
                     'Day2' : d2,     'Day3' : d3,
                     'Day4' : d4,     'Day5' : d5,
                     'Day6' : d6,     'Day7' : d7,
                    'Day8' : d8,
                     'Gender': gender, 'ID'  : id_col
                    })

#################### test on unpaired example (paired = True)
unpaired = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3"),("Day4","Day5", "Day6")))
unpaired.cohens_d.plot(color_col="Gender")
pylab.show()

unpaired

And here is an example for paired but not repeated measures:

#################### test on paired example (paired = True)
paired = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3"),("Day4","Day5")), paired = True)

paired.hedges_g.plot(color_col="Gender")
pylab.show()

paired

LI-Yixuan commented 3 years ago

Looking good!

Shouldn't both baseline and sequential use the slopegraph for the observed values?

Hi Prof, thanks for pointing it out! I will change this in my code :)

josesho commented 3 years ago

Looking good! Shouldn't both baseline and sequential use the slopegraph for the observed values?

@adamcc not for the baseline plots. You'd either have to do "Day1-Day2", "Day1-Day3; ie a slope graph plot for each pair. You could still have a hub-and-spoke type layout, which would render the slope graphs impossible.

adamcc commented 3 years ago

In paired data, the slopes have two functions, yes they indicate the individual ∆s between timepoints, but they also show which data comes from which subjects. This is a critical feature, so even if the group ∆s are calculated as Day2-Day1, Day3-Day1, and so on, I think the slopegraphs should be retained for both flavors of repeated measures plots.

LI-Yixuan commented 3 years ago

Here is the demo code for the 7th commit:

import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt

from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population

c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1,     'Test 1' : t1,
                     'Control 2' : c2,     'Test 2' : t2,
                     'Control 3' : c3,     'Test 3' : t3,
                     'Test 4'    : t4,     'Test 5' : t5, 'Test 6' : t6,
                     'Gender'    : gender, 'ID'  : id_col
                    })

example of pair plot

pair = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1"),
                                           ("Control 2", "Test 2"),
                                           ("Control 3", "Test 3"))
                                     , paired = True)

pair.median_diff.plot(color_col="Gender");
pylab.show()

paired

example of sequential repeated measures

sequential = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "sequential")

sequential.mean_diff.plot(color_col="Gender");
pylab.show()

sequential

example of baseline repeated measures

baseline = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "baseline")

baseline.cohens_d.plot(color_col="Gender");
pylab.show()

baseline

example of shared-control plot

shared = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ))

shared.hedges_g.plot(color_col="Gender");
pylab.show()

shared-control

adamcc commented 3 years ago

Looking good. But I'm curious about the shared control (sc) versus the baseline repeated measures (rm). I think the rm ∆ curves should be smaller than the sc curves, but since all the effect sizes are different I can't be sure what is going on. Can you please post a version with just the mean difference in all cases?

LI-Yixuan commented 3 years ago

Looking good. But I'm curious about the shared control (sc) versus the baseline repeated measures (rm). I think the rm ∆ curves should be smaller than the sc curves, but since all the effect sizes are different I can't be sure what is going on. Can you please post a version with just the mean difference in all cases?

Hi Prof, here is the example of the 4 versions of plots of mean_difference:

Example of Paired data

import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt

from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create samples
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1,     'Test 1' : t1,
                     'Control 2' : c2,     'Test 2' : t2,
                     'Control 3' : c3,     'Test 3' : t3,
                     'Test 4'    : t4,     'Test 5' : t5, 'Test 6' : t6,
                     'Gender'    : gender, 'ID'  : id_col
                    })

# example of pair plot
pair = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1"),
                                           ("Control 2", "Test 2"),
                                           ("Control 3", "Test 3"))
                                     , paired = True)

pair.mean_diff.plot(color_col="Gender");
pylab.show()

paired

Example of sequential repeated measures

sequential = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "sequential")

sequential.mean_diff.plot(color_col="Gender");
pylab.show()

sequential

Example of baseline repeated measures

baseline = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "baseline")

baseline.mean_diff.plot(color_col="Gender");
pylab.show()

baseline

Example of shared control experiment

shared = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ))

shared.mean_diff.plot(color_col="Gender");
pylab.show()

shared-control

The numerical results for rm(baseline) and sc are attached here for reference:

sc

shared.mean_diff

rm

baseline.mean_diff

adamcc commented 3 years ago

Thanks looking great. I think the reason the error bars aren't shrinking from sc to rm is because it is synthetic data. We should do some tests with real data.

LI-Yixuan commented 3 years ago

Ok noted! I will look for proper real data and test on these plots again :)

LI-Yixuan commented 3 years ago

Hi Prof, here is the plot using the real data "0to2_beforeduringafter.csv":

import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt

data = pd.read_csv("0to2_beforeduringafter.csv").dropna()
data = data.rename(columns={'120beforeFeedSpeed_mm/s_Mean':"before",
                            'duringFeedSpeed_mm/s_Mean':"during",
                            '120afterFeedSpeed_mm/s_Mean':"after"})

Example of Sequential rm plot:

sequential = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"),
                                        repeated_measures = "sequential")

sequential.mean_diff.plot(color_col="Sex");
pylab.show()

sequential_real

Example of Baseline rm plot

baseline = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"),
                                        repeated_measures = "baseline")

baseline.mean_diff.plot(color_col="Sex");
pylab.show()

baseline_real

Example of Shared control plot

# example of shared-control plot
shared = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"))

shared.mean_diff.plot(color_col="Sex", raw_marker_size=1.7);
pylab.show()

shared_real

adamcc commented 3 years ago

OK great, that's behaving the way I expected. It seems ready.

LI-Yixuan commented 3 years ago

OK great, that's behaving the way I expected. It seems ready.

Ok noted! I think while waiting for some suggestions of implementation from Joses, I shall start to create test units and tutorial doc as mentioned by him as a wrap-up to this function.

josesho commented 3 years ago

Hi @LI-Yixuan and @adamcc ,

Repeated measures plot are a special version of paired plots. I think (conceptually) treating them as different plots is a bad design strategy and making things more complicated. The current 2-group paired plots are basically 2-group baseline-corrected plots.

So rather than introducing a new keyword repeated_measures, we should introduce more options to the current paired keyword. As of now it takes True or False; we can instead change the options to None, sequential, and baseline.
I think the baseline plot should look actually like a multi-paired plot rather than a sequential plot. ie: except that the current DABEST doesn't accept repeated columns in idx; @LI-Yixuan , you could consider relaxing this restriction?

Other than that, great work! I will test more intensively nearer the end of this week, fingers crossed!

LI-Yixuan commented 3 years ago

Hi @josesho Thank you so much for your suggestions!

Regarding the first point you mentioned:

I also feel the keyword paired may be confusing to users because the current meaning of paired is "the comparison involves only 2 groups" just the same as before. But the computation of effect size for repeated measures experiments is actually conceptually paired as well. From this point of view, I think perhaps we could remove it.
If paired is removed then the implementation will be almost the same as #114 . Would that be a preferable design of implementation? (#114 does not have is_paired as well, Iuse repeated_measures to replace is_paired)
I am not very familiar with experiment designs and therefore not sure of the situations when people make use of pair plot. I am thinking about a case when there are only 2 groups: control and test. I am not too sure if users might feel it a bit strange to specify repeated_measures with "sequential"\"baseline" to get a paired plot...Will this be a bit unnatural to users? Or shall we add another value "pairwise" to repeated_measures?

Regarding the second point:

Just a quick clarification on the baseline rm plot, instead of a sequential plot, a better plot would be like this (I photoshoped the column name):
```
baseline = dabest.load(data, id_col = 'ID', idx=("Control_1", "test_1", "test_2"),
                                    repeated_measures = "baseline")
```
Do I understand correctly?

And yes no worries! Just take your time and have a look at the code when it is convenient for you! Many thanks to you!

josesho commented 3 years ago

Closing for now as #114 supersedes it

ACCLAB / DABEST-python