jackknife is slow - Githubissues

clarkedavida commented 7 months ago

parallelizing doesn't seem to help actually... see for instance this code by @redweasel:

from latqcdtools.statistics.jackknife import jackknife
import numpy as np
import time
import multiprocessing as mp

# jackknife estimator for a simple estimator f
# f - function that takes  data array and returns its estimated value (can be an ndarray)
# data - The data array with the shape [samples, ...]
# blocks - the number of jackknife blocks
def jackknife2(f, data, blocks=20):
    data = np.asarray(data)
    block_id = np.linspace(0, blocks, len(data),
                    endpoint=False).astype(np.int32)
    J = [f(data[block_id != i]) for i in range(blocks)]
    # slower version:
    #J = [f(data[np.concatenate([np.arange(len(data)*i//blocks),
    #                 np.arange(len(data)*(i+1)//blocks, len(data))])]) for i in range(blocks)]
    return blocks*f(data) - (blocks-1)*np.mean(J, axis=0),\
           np.std(J, axis=0)*(blocks-1)**.5

# jackknife estimator for a simple estimator f
# f - function that takes  data array and returns its estimated value (can be an ndarray)
# data - The data array with the shape [samples, ...]
# blocks - the number of jackknife blocks
def jackknife3(f, data, blocks=20):
    data = np.asarray(data)
    block_id = np.linspace(0, blocks, len(data),
                    endpoint=False).astype(np.int32)
    # sending the data to the processes (copying it) is very tedious and takes about as much time as the function call itself.
    # this is only worth it for very slow functions f (i.e. if f is O(n^2))
    J = mp.Pool(mp.cpu_count()).map(f, (data[block_id != i] for i in range(blocks)))
    return blocks*f(data) - (blocks-1)*np.mean(J, axis=0),\
           np.std(J, axis=0)*(blocks-1)**.5

data = np.random.random(10000000)

start = time.time()
print(jackknife2(np.std, data, 20))
print(jackknife2(np.median, data, 20))
print(jackknife2(lambda x: 1 / np.mean(x), data, 20))
print("took", time.time() - start, "s")

start = time.time()
print(jackknife3(np.std, data, 20))
print(jackknife3(np.median, data, 20))
# this version doesn't accept lambdas because of multiprocessing
def inv_mean(x):
    return 1 / np.mean(x)
print(jackknife3(inv_mean, data, 20))
print("took", time.time() - start, "s")

start = time.time()
print(jackknife(np.std, data, 20, conf_axis=0))
print(jackknife(np.median, data, 20, conf_axis=0))
# this version doesn't accept lambdas because of multiprocessing
def inv_mean(x):
    return 1 / np.mean(x)
print(jackknife(inv_mean, data, 20, conf_axis=0))
print("took", time.time() - start, "s")

redweasel commented 7 months ago

The code in this toolbox handles quite general functions. It can handle functions, which return tuples of ndarrays. That is quite a nasty combination, as an array of tuples of ndarrays can only be flattened to an ndarray if the ndarrays in the tuple are of equal shape. The tests contain cases where that is not the case. A fast version of the jackknife which passes the tests is the following:

def jackknife(f, data, numb_blocks=20, conf_axis=1, nproc=None, args=()):
    if numb_blocks <= 1: # this is how David decided to specify using blocks of size 1
        # TODO test this better, as the current test using "simple_mean" could pass by accident
        J = [f(data[i]) for i in range(len(data))]
        total = f(data, *args)
        if type(J[0]) is tuple:
            # transpose J
            J = list(zip(*J)) # slow!
            # now compute the mean for each one separately
            return [numb_blocks * np.array(t) - (numb_blocks - 1) * np.mean(j, axis=0) for j, t in zip(J, total)],\
                [np.std(j, axis=0, ddof=1) for j in J]
        else:
            J = np.array(J)
            return numb_blocks * np.array(total) - (numb_blocks - 1) * np.mean(J, axis=0), np.std(J, axis=0, ddof=1)
        # the following doesn't work for some reason...
        # the difference is in np.std(J, axis=0, ddof=1) != np.std(j, axis=0)*(numb_blocks - 1)**.5
        #numb_blocks = np.shape(data)[conf_axis]
    data = np.asarray(data)
    n = data.shape[conf_axis]
    total = f(data, *args)
    block_id = np.linspace(0, numb_blocks, n, endpoint=False).astype(np.int32)
    J = [f(np.compress((block_id != i), data, axis=conf_axis), *args)
           for i in range(numb_blocks)]
    if type(J[0]) is tuple:
        # in David's code f can produce multiple differently sized outputs
        # -> transpose J
        J = list(zip(*J))
        # now compute the mean for each one separately
        return [numb_blocks * np.array(t) - (numb_blocks - 1) * np.mean(j, axis=0) for j, t in zip(J, total)],\
            [np.std(j, axis=0)*(numb_blocks - 1)**.5 for j in J]
    else:
        J = np.array(J)
        return numb_blocks * np.array(total) - (numb_blocks - 1) * np.mean(J, axis=0), np.std(J, axis=0)*(numb_blocks - 1)**.5

This performs almost as good as the jackknife2 from above, but has all the features. I'm not sure on the behavior for numb_blocks=1 as the test only tested mean and I didn't test a more complicated function. It could very well be wrong. I will likely not come back to this issue anytime soon, so if you want this in the toolbox, you will have to take it from here. The most important thing for the performance is just avoiding copies as much as possible, as well as not using pythons iterators with e.g. list(range(...)).

clarkedavida commented 7 months ago

alright thank you again for catching this slowdown and looking into it. i will try to adapt this to the current jackknife

clarkedavida commented 7 months ago

done. thanks henrik

LatticeQCD / AnalysisToolbox

jackknife is slow #52