Closed clarkedavida closed 7 months ago
The code in this toolbox handles quite general functions. It can handle functions, which return tuples of ndarrays. That is quite a nasty combination, as an array of tuples of ndarrays can only be flattened to an ndarray if the ndarrays in the tuple are of equal shape. The tests contain cases where that is not the case. A fast version of the jackknife which passes the tests is the following:
def jackknife(f, data, numb_blocks=20, conf_axis=1, nproc=None, args=()):
if numb_blocks <= 1: # this is how David decided to specify using blocks of size 1
# TODO test this better, as the current test using "simple_mean" could pass by accident
J = [f(data[i]) for i in range(len(data))]
total = f(data, *args)
if type(J[0]) is tuple:
# transpose J
J = list(zip(*J)) # slow!
# now compute the mean for each one separately
return [numb_blocks * np.array(t) - (numb_blocks - 1) * np.mean(j, axis=0) for j, t in zip(J, total)],\
[np.std(j, axis=0, ddof=1) for j in J]
else:
J = np.array(J)
return numb_blocks * np.array(total) - (numb_blocks - 1) * np.mean(J, axis=0), np.std(J, axis=0, ddof=1)
# the following doesn't work for some reason...
# the difference is in np.std(J, axis=0, ddof=1) != np.std(j, axis=0)*(numb_blocks - 1)**.5
#numb_blocks = np.shape(data)[conf_axis]
data = np.asarray(data)
n = data.shape[conf_axis]
total = f(data, *args)
block_id = np.linspace(0, numb_blocks, n, endpoint=False).astype(np.int32)
J = [f(np.compress((block_id != i), data, axis=conf_axis), *args)
for i in range(numb_blocks)]
if type(J[0]) is tuple:
# in David's code f can produce multiple differently sized outputs
# -> transpose J
J = list(zip(*J))
# now compute the mean for each one separately
return [numb_blocks * np.array(t) - (numb_blocks - 1) * np.mean(j, axis=0) for j, t in zip(J, total)],\
[np.std(j, axis=0)*(numb_blocks - 1)**.5 for j in J]
else:
J = np.array(J)
return numb_blocks * np.array(total) - (numb_blocks - 1) * np.mean(J, axis=0), np.std(J, axis=0)*(numb_blocks - 1)**.5
This performs almost as good as the jackknife2
from above, but has all the features. I'm not sure on the behavior for numb_blocks=1
as the test only tested mean and I didn't test a more complicated function. It could very well be wrong.
I will likely not come back to this issue anytime soon, so if you want this in the toolbox, you will have to take it from here. The most important thing for the performance is just avoiding copies as much as possible, as well as not using pythons iterators with e.g. list(range(...))
.
alright thank you again for catching this slowdown and looking into it. i will try to adapt this to the current jackknife
done. thanks henrik
parallelizing doesn't seem to help actually... see for instance this code by @redweasel: