Closed LI-Yixuan closed 3 years ago
Thanks for this.
Before we merge this, a few things to be done:
is_preceding
keyword. Since this is a repeated measures plot, can we instead change it to repeated_measure
instead? This is more informative than is_preceding
. Looking forward to your additions to this PR!
Joses
Hi @LI-Yixuan ,
could you include a copy-pastable example on what new functionality is included, and how to use?
eg:
import dabest
dabest.load(iris)
# continue code...
Hi! Here is an example:
import numpy as np
import pandas as pd
import dabest
import pylab
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population
# Create samples
d0 = norm.rvs(loc=3, scale=0.4, size=Ns)
d1 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d2 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d3 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
d4 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
d5 = norm.rvs(loc=3, scale=0.75, size=Ns)
d6 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d7 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d8 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))
# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Day0' : d0, 'Day1' : d1,
'Day2' : d2, 'Day3' : d3,
'Day4' : d4, 'Day5' : d5,
'Day6' : d6, 'Day7' : d7,
'Day8' : d8,
'Gender': gender, 'ID' : id_col
})
#################### test on repeated_measure = baseline example
baseline = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
("Day2", "Day3","Day4"),
("Day5", "Day6","Day7", "Day8")), repeated_measures = "baseline"
)
baseline.cohens_d.plot(color_col="Gender")
pylab.show()
#####################repeated_measure = sequential example
sequential = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
("Day2", "Day3","Day4"),
("Day5", "Day6","Day7", "Day8")), repeated_measures = "sequential"
)
print(sequential.median_diff)
sequential.median_diff.plot(color_col="Gender")
pylab.show()
Hi @LI-Yixuan ,
Thanks,
Can you demonstrate the syntax (and plots) in the case where the datasets are not paired?
Joses
Looking good!
Shouldn't both baseline and sequential use the slopegraph for the observed values?
Hi @LI-Yixuan ,
Thanks,
Can you demonstrate the syntax (and plots) in the case where the datasets are not paired?
Joses
Hi! The 'paired' parameter is used in the same way as the current package. Here is an example for unpaired data:
import numpy as np
import pandas as pd
import dabest
import pylab
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population
# Create samples
d0 = norm.rvs(loc=3, scale=0.4, size=Ns)
d1 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d2 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d3 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
d4 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
d5 = norm.rvs(loc=3, scale=0.75, size=Ns)
d6 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
d7 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
d8 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))
# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Day0' : d0, 'Day1' : d1,
'Day2' : d2, 'Day3' : d3,
'Day4' : d4, 'Day5' : d5,
'Day6' : d6, 'Day7' : d7,
'Day8' : d8,
'Gender': gender, 'ID' : id_col
})
#################### test on unpaired example (paired = True)
unpaired = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
("Day2", "Day3"),("Day4","Day5", "Day6")))
unpaired.cohens_d.plot(color_col="Gender")
pylab.show()
And here is an example for paired but not repeated measures:
#################### test on paired example (paired = True)
paired = dabest.load(df, id_col = "ID", idx=(("Day0", "Day1"),
("Day2", "Day3"),("Day4","Day5")), paired = True)
paired.hedges_g.plot(color_col="Gender")
pylab.show()
Looking good!
Shouldn't both baseline and sequential use the slopegraph for the observed values?
Hi Prof, thanks for pointing it out! I will change this in my code :)
Looking good! Shouldn't both baseline and sequential use the slopegraph for the observed values?
@adamcc not for the baseline plots. You'd either have to do "Day1-Day2", "Day1-Day3; ie a slope graph plot for each pair. You could still have a hub-and-spoke type layout, which would render the slope graphs impossible.
In paired data, the slopes have two functions, yes they indicate the individual ∆s between timepoints, but they also show which data comes from which subjects. This is a critical feature, so even if the group ∆s are calculated as Day2-Day1, Day3-Day1, and so on, I think the slopegraphs should be retained for both flavors of repeated measures plots.
Here is the demo code for the 7th commit:
import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
Ns = 20 # The number of samples taken from each population
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
id_col = pd.Series(range(1, Ns+1))
# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1, 'Test 1' : t1,
'Control 2' : c2, 'Test 2' : t2,
'Control 3' : c3, 'Test 3' : t3,
'Test 4' : t4, 'Test 5' : t5, 'Test 6' : t6,
'Gender' : gender, 'ID' : id_col
})
pair = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1"),
("Control 2", "Test 2"),
("Control 3", "Test 3"))
, paired = True)
pair.median_diff.plot(color_col="Gender");
pylab.show()
sequential = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
"Test 2","Test 3"),
("Control 2", "Test 4","Test 5", "Test 6")
), repeated_measures = "sequential")
sequential.mean_diff.plot(color_col="Gender");
pylab.show()
baseline = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
"Test 2","Test 3"),
("Control 2", "Test 4","Test 5", "Test 6")
), repeated_measures = "baseline")
baseline.cohens_d.plot(color_col="Gender");
pylab.show()
shared = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
"Test 2","Test 3"),
("Control 2", "Test 4","Test 5", "Test 6")
))
shared.hedges_g.plot(color_col="Gender");
pylab.show()
Looking good. But I'm curious about the shared control (sc) versus the baseline repeated measures (rm). I think the rm ∆ curves should be smaller than the sc curves, but since all the effect sizes are different I can't be sure what is going on. Can you please post a version with just the mean difference in all cases?
Looking good. But I'm curious about the shared control (sc) versus the baseline repeated measures (rm). I think the rm ∆ curves should be smaller than the sc curves, but since all the effect sizes are different I can't be sure what is going on. Can you please post a version with just the mean difference in all cases?
Hi Prof, here is the example of the 4 versions of plots of mean_difference:
import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt
from scipy.stats import norm
np.random.seed(9999) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
Ns = 20 # The number of samples taken from each population
# Create samples
c1 = norm.rvs(loc=3, scale=0.4, size=Ns)
c2 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
c3 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t1 = norm.rvs(loc=3.5, scale=0.5, size=Ns)
t2 = norm.rvs(loc=2.5, scale=0.6, size=Ns)
t3 = norm.rvs(loc=3, scale=0.75, size=Ns)
t4 = norm.rvs(loc=3.5, scale=0.75, size=Ns)
t5 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
t6 = norm.rvs(loc=3.25, scale=0.4, size=Ns)
# Add a `gender` column for coloring the data.
females = np.repeat('Female', Ns/2).tolist()
males = np.repeat('Male', Ns/2).tolist()
gender = females + males
# Add an `id` column for paired data plotting.
id_col = pd.Series(range(1, Ns+1))
# Combine samples and gender into a DataFrame.
df = pd.DataFrame({'Control 1' : c1, 'Test 1' : t1,
'Control 2' : c2, 'Test 2' : t2,
'Control 3' : c3, 'Test 3' : t3,
'Test 4' : t4, 'Test 5' : t5, 'Test 6' : t6,
'Gender' : gender, 'ID' : id_col
})
# example of pair plot
pair = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1"),
("Control 2", "Test 2"),
("Control 3", "Test 3"))
, paired = True)
pair.mean_diff.plot(color_col="Gender");
pylab.show()
sequential = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
"Test 2","Test 3"),
("Control 2", "Test 4","Test 5", "Test 6")
), repeated_measures = "sequential")
sequential.mean_diff.plot(color_col="Gender");
pylab.show()
baseline = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
"Test 2","Test 3"),
("Control 2", "Test 4","Test 5", "Test 6")
), repeated_measures = "baseline")
baseline.mean_diff.plot(color_col="Gender");
pylab.show()
shared = dabest.load(df, id_col = "ID", idx=(("Control 1", "Test 1",
"Test 2","Test 3"),
("Control 2", "Test 4","Test 5", "Test 6")
))
shared.mean_diff.plot(color_col="Gender");
pylab.show()
The numerical results for rm(baseline) and sc are attached here for reference:
shared.mean_diff
baseline.mean_diff
Thanks looking great. I think the reason the error bars aren't shrinking from sc to rm is because it is synthetic data. We should do some tests with real data.
Ok noted! I will look for proper real data and test on these plots again :)
Hi Prof, here is the plot using the real data "0to2_beforeduringafter.csv":
import numpy as np
import pandas as pd
import dabest
import pylab
import matplotlib.pyplot as plt
data = pd.read_csv("0to2_beforeduringafter.csv").dropna()
data = data.rename(columns={'120beforeFeedSpeed_mm/s_Mean':"before",
'duringFeedSpeed_mm/s_Mean':"during",
'120afterFeedSpeed_mm/s_Mean':"after"})
sequential = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"),
repeated_measures = "sequential")
sequential.mean_diff.plot(color_col="Sex");
pylab.show()
baseline = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"),
repeated_measures = "baseline")
baseline.mean_diff.plot(color_col="Sex");
pylab.show()
# example of shared-control plot
shared = dabest.load(data, id_col = 'ChamberID', idx=("before", "during", "after"))
shared.mean_diff.plot(color_col="Sex", raw_marker_size=1.7);
pylab.show()
OK great, that's behaving the way I expected. It seems ready.
OK great, that's behaving the way I expected. It seems ready.
Ok noted! I think while waiting for some suggestions of implementation from Joses, I shall start to create test units and tutorial doc as mentioned by him as a wrap-up to this function.
Hi @LI-Yixuan and @adamcc ,
Repeated measures plot are a special version of paired plots. I think (conceptually) treating them as different plots is a bad design strategy and making things more complicated. The current 2-group paired plots are basically 2-group baseline-corrected plots.
So rather than introducing a new keyword repeated_measures
, we should introduce more options to the current paired
keyword. As of now it takes True
or False
; we can instead change the options to None
, sequential
, and baseline
.
I think the baseline plot should look actually like a multi-paired plot rather than a sequential plot. ie:
except that the current DABEST doesn't accept repeated columns in idx
; @LI-Yixuan , you could consider relaxing this restriction?
Other than that, great work! I will test more intensively nearer the end of this week, fingers crossed!
Hi @josesho Thank you so much for your suggestions!
paired
may be confusing to users because the current meaning of paired
is "the comparison involves only 2 groups" just the same as before. But the computation of effect size for repeated measures experiments is actually conceptually paired as well. From this point of view, I think perhaps we could remove it.paired
is removed then the implementation will be almost the same as #114 . Would that be a preferable design of implementation? (#114 does not have is_paired
as well, Iuse repeated_measures
to replace is_paired
)repeated_measures
with "sequential"\"baseline" to get a paired plot...Will this be a bit unnatural to users? Or shall we add another value "pairwise" to repeated_measures
?baseline = dabest.load(data, id_col = 'ID', idx=("Control_1", "test_1", "test_2"),
repeated_measures = "baseline")
Do I understand correctly?
And yes no worries! Just take your time and have a look at the code when it is convenient for you! Many thanks to you!
Closing for now as #114 supersedes it
Repeated measures function achieved: by default, the effect size is obtained via baseline subtraction.