cgevans / scikits-bootstrap

Python/numpy bootstrap confidence interval estimation.
Other
174 stars 36 forks source link

Error raised with pandas data frame #34

Open Federico2111 opened 2 years ago

Federico2111 commented 2 years ago

Hello,

When the input data is a pandas data frame, an error is raised:

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scikits/bootstrap/bootstrap.py", line 179, in ci lengths = [x.shape[0] for x in tdata] File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scikits/bootstrap/bootstrap.py", line 179, in lengths = [x.shape[0] for x in tdata] IndexError: tuple index out of range

In the code, it is explained why:

334 # Ensure that the data is actually an array. This isn't nice to pandas, 335 # but pandas seems much much slower and the indices become a problem. 336 if multi and isinstance(data, Iterable): 337 tdata: "Tuple[NDArrayAny, ...]" = tuple(np.array(x) for x in data) 338 lengths = [x.shape[0] for x in tdata]

Any suggestion?

cgevans commented 2 years ago

It doesn't appear that this is simply from a dataframe. Eg, the following works:

import pandas as pd
import numpy as np
import scikits.bootstrap as boot
boot.ci(pd.DataFrame(np.random.randn(100)))
Federico2111 commented 2 years ago

I am trying to bootstrap the eta squared effect size, calculated with these anova libraries: https://pingouin-stats.org/generated/pingouin.anova.html https://pingouin-stats.org/generated/pingouin.rm_anova.html The input for these libraries has to be a pandas data frame, structured as you can see in the description of the libraries. You might want to look also at the pingouin data sets, mentioned in the examples, to see exactly how the data frames have to be structured to work with these libraries.

If I use your bootstrap with these pandas data frames as input and these anova libraries as function, that error I shared is raised.

I solved the problem this way. I created a function where the raw data sets are fed as input, not in a pandas data frame format. Within my function, the input data sets get structured as a pandas data frame, which is then inputted to the anova library to calculate eta squared, which is returned by my function. I used your bootstrap, inputting the raw data sets and evoking my function. This way, I avoid inputting a pandas data frame to your bootstrap, which raises an error. This approach works correctly and I get the bootstrap confidence interval around eta squared.

The important point is that, using your bootstrap, "multi" needs to be set to "paired". In fact, with "multi=paired", the input data sets (arrays) are sampled together and the link/correspondence between/among the values in each array, at a particular index, is maintained. This is necessary to recreate a correct pandas data frame, within my function, to feed to the anova library, since the data sets have to be related index to index (participant number (subject) - measured value (dependent variable) - between/within factor). This link is not maintained with "multi=independent", where arrays are sampled separately and have unequal length, thus it is not possible to recreate a correct data frame, and an error is also raised due to the unequal size arrays fed to the anova library.