ACCLAB / DABEST-python

Data Analysis with Bootstrapped ESTimation
https://acclab.github.io/DABEST-python/
Apache License 2.0
339 stars 47 forks source link

Repeated measures function #112

Closed LI-Yixuan closed 3 years ago

LI-Yixuan commented 3 years ago

Repeated measures function achieved: by default, the effect size is obtained via baseline subtraction.

josesho commented 3 years ago

Thanks for this.

Before we merge this, a few things to be done:

  1. You've added a is_preceding keyword. Since this is a repeated measures plot, can we instead change it to repeated_measure instead? This is more informative than is_preceding.
  2. After you've done that, could you attach here a minimal reproducible code example to demonstrate what the expected use case and outcome is?
  3. Then, you will want to add a unit test for this new functionality. Ping back here if you need help!
  4. Finally, we'll need to document this inline, as well as in the official docs. Again if you need any assistance, ping back here!

Looking forward to your additions to this PR!

Joses

josesho commented 3 years ago

Hi @LI-Yixuan ,

could you include a copy-pastable example on what new functionality is included, and how to use?

eg:

import dabest
dabest.load(iris)
# continue code...
LI-Yixuan commented 3 years ago

Hi! Here is an example:

import numpy as np
import pandas as pd
import dabest
import pylab
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create samples
d0 = norm.rvs(loc=3, scale=0.4, size=Ns)
d1 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d2 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d3 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
d4 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
d5 = norm.rvs(loc=3, scale=0.75, size=Ns)
d6 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d7 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d8 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Day0' : d0,     'Day1' : d1,
                     'Day2' : d2,     'Day3' : d3,
                     'Day4' : d4,     'Day5' : d5,
                     'Day6' : d6,     'Day7' : d7,
                    'Day8' : d8,
                     'Gender': gender, 'ID'  : id_col
                    })
#################### test on repeated_measure = baseline example
baseline = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3","Day4"),
                                       ("Day5", "Day6","Day7", "Day8")), repeated_measures = "baseline"
                                     )
baseline.cohens_d.plot(color_col="Gender")
pylab.show()

image

#####################repeated_measure = sequential example
sequential = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3","Day4"),
                                       ("Day5", "Day6","Day7", "Day8")), repeated_measures = "sequential"
                                     )
print(sequential.median_diff)
sequential.median_diff.plot(color_col="Gender")
pylab.show()

sequential

josesho commented 3 years ago

Hi @LI-Yixuan ,

Thanks,

Can you demonstrate the syntax (and plots) in the case where the datasets are not paired?

Joses

adamcc commented 3 years ago

Looking good!

Shouldn't both baseline and sequential use the slopegraph for the observed values?

LI-Yixuan commented 3 years ago

Hi @LI-Yixuan ,

Thanks,

Can you demonstrate the syntax (and plots) in the case where the datasets are not paired?

Joses

Hi! The 'paired' parameter is used in the same way as the current package. Here is an example for unpaired data:

import numpy as np
import pandas as pd
import dabest
import pylab

from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create samples
d0 = norm.rvs(loc=3, scale=0.4, size=Ns)
d1 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d2 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d3 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
d4 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
d5 = norm.rvs(loc=3, scale=0.75, size=Ns)
d6 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d7 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d8 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Day0' : d0,     'Day1' : d1,
                     'Day2' : d2,     'Day3' : d3,
                     'Day4' : d4,     'Day5' : d5,
                     'Day6' : d6,     'Day7' : d7,
                    'Day8' : d8,
                     'Gender': gender, 'ID'  : id_col
                    })

#################### test on unpaired example (paired = True)
unpaired = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3"),("Day4","Day5", "Day6")))
unpaired.cohens_d.plot(color_col="Gender")
pylab.show()

unpaired

And here is an example for paired but not repeated measures:

#################### test on paired example (paired = True)
paired = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
                                       ("Day2", "Day3"),("Day4","Day5")), paired = True)

paired.hedges_g.plot(color_col="Gender")
pylab.show()

paired

LI-Yixuan commented 3 years ago

Looking good!

Shouldn't both baseline and sequential use the slopegraph for the observed values?

Hi Prof, thanks for pointing it out! I will change this in my code :)

josesho commented 3 years ago

Looking good! Shouldn't both baseline and sequential use the slopegraph for the observed values?

@adamcc not for the baseline plots. You'd either have to do "Day1-Day2", "Day1-Day3; ie a slope graph plot for each pair. You could still have a hub-and-spoke type layout, which would render the slope graphs impossible.

adamcc commented 3 years ago

In paired data, the slopes have two functions, yes they indicate the individual ∆s between timepoints, but they also show which data comes from which subjects. This is a critical feature, so even if the group ∆s are calculated as Day2-Day1, Day3-Day1, and so on, I think the slopegraphs should be retained for both flavors of repeated measures plots.

LI-Yixuan commented 3 years ago

Here is the demo code for the 7th commit:

import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt

from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population

c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1,     'Test 1' : t1,
                     'Control 2' : c2,     'Test 2' : t2,
                     'Control 3' : c3,     'Test 3' : t3,
                     'Test 4'    : t4,     'Test 5' : t5, 'Test 6' : t6,
                     'Gender'    : gender, 'ID'  : id_col
                    })

example of pair plot

pair = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1"),
                                           ("Control 2", "Test 2"),
                                           ("Control 3", "Test 3"))
                                     , paired = True)

pair.median_diff.plot(color_col="Gender");
pylab.show()

paired

example of sequential repeated measures

sequential = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "sequential")

sequential.mean_diff.plot(color_col="Gender");
pylab.show()

sequential

example of baseline repeated measures

baseline = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "baseline")

baseline.cohens_d.plot(color_col="Gender");
pylab.show()

baseline

example of shared-control plot

shared = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ))

shared.hedges_g.plot(color_col="Gender");
pylab.show()

shared-control

adamcc commented 3 years ago

Looking good. But I'm curious about the shared control (sc) versus the baseline repeated measures (rm). I think the rm ∆ curves should be smaller than the sc curves, but since all the effect sizes are different I can't be sure what is going on. Can you please post a version with just the mean difference in all cases?

LI-Yixuan commented 3 years ago

Looking good. But I'm curious about the shared control (sc) versus the baseline repeated measures (rm). I think the rm ∆ curves should be smaller than the sc curves, but since all the effect sizes are different I can't be sure what is going on. Can you please post a version with just the mean difference in all cases?

Hi Prof, here is the example of the 4 versions of plots of mean_difference:

Example of Paired data

import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt

from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population

# Create samples
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)

# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males

# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))

# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1,     'Test 1' : t1,
                     'Control 2' : c2,     'Test 2' : t2,
                     'Control 3' : c3,     'Test 3' : t3,
                     'Test 4'    : t4,     'Test 5' : t5, 'Test 6' : t6,
                     'Gender'    : gender, 'ID'  : id_col
                    })

# example of pair plot
pair = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1"),
                                           ("Control 2", "Test 2"),
                                           ("Control 3", "Test 3"))
                                     , paired = True)

pair.mean_diff.plot(color_col="Gender");
pylab.show()

paired

Example of sequential repeated measures

sequential = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "sequential")

sequential.mean_diff.plot(color_col="Gender");
pylab.show()

sequential

Example of baseline repeated measures

baseline = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ), repeated_measures = "baseline")

baseline.mean_diff.plot(color_col="Gender");
pylab.show()

baseline

Example of shared control experiment

shared = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
                                       "Test 2","Test 3"),
                                       ("Control 2", "Test 4","Test 5", "Test 6")
                                     ))

shared.mean_diff.plot(color_col="Gender");
pylab.show()

shared-control

The numerical results for rm(baseline) and sc are attached here for reference:

sc

shared.mean_diff

图片

rm

baseline.mean_diff

图片

adamcc commented 3 years ago

Thanks looking great. I think the reason the error bars aren't shrinking from sc to rm is because it is synthetic data. We should do some tests with real data.

LI-Yixuan commented 3 years ago

Ok noted! I will look for proper real data and test on these plots again :)

LI-Yixuan commented 3 years ago

Hi Prof, here is the plot using the real data "0to2_beforeduringafter.csv":

import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt

data = pd.read_csv("0to2_beforeduringafter.csv").dropna()
data = data.rename(columns={'120beforeFeedSpeed_mm/s_Mean':"before",
                            'duringFeedSpeed_mm/s_Mean':"during",
                            '120afterFeedSpeed_mm/s_Mean':"after"})

Example of Sequential rm plot:

sequential = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"),
                                        repeated_measures = "sequential")

sequential.mean_diff.plot(color_col="Sex");
pylab.show()

sequential_real

Example of Baseline rm plot

baseline = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"),
                                        repeated_measures = "baseline")

baseline.mean_diff.plot(color_col="Sex");
pylab.show()

baseline_real

Example of Shared control plot

# example of shared-control plot
shared = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"))

shared.mean_diff.plot(color_col="Sex", raw_marker_size=1.7);
pylab.show()

shared_real

adamcc commented 3 years ago

OK great, that's behaving the way I expected. It seems ready.

LI-Yixuan commented 3 years ago

OK great, that's behaving the way I expected. It seems ready.

Ok noted! I think while waiting for some suggestions of implementation from Joses, I shall start to create test units and tutorial doc as mentioned by him as a wrap-up to this function.

josesho commented 3 years ago

Hi @LI-Yixuan and @adamcc ,

Repeated measures plot are a special version of paired plots. I think (conceptually) treating them as different plots is a bad design strategy and making things more complicated. The current 2-group paired plots are basically 2-group baseline-corrected plots.

  1. So rather than introducing a new keyword repeated_measures, we should introduce more options to the current paired keyword. As of now it takes True or False; we can instead change the options to None, sequential, and baseline.

  2. I think the baseline plot should look actually like a multi-paired plot rather than a sequential plot. ie: except that the current DABEST doesn't accept repeated columns in idx; @LI-Yixuan , you could consider relaxing this restriction?

Other than that, great work! I will test more intensively nearer the end of this week, fingers crossed!

LI-Yixuan commented 3 years ago

Hi @josesho Thank you so much for your suggestions!

Regarding the first point you mentioned:

Regarding the second point:

And yes no worries! Just take your time and have a look at the code when it is convenient for you! Many thanks to you!

josesho commented 3 years ago

Closing for now as #114 supersedes it