choshin84 commented 4 years ago

Tweet summary

average will be 1/e ~ 36% will be OOB every time running bootstrap thus recommend to repeat it multiple times >10 to ensure it'll use all data

Experiment code

t0 = time.time()
n_data = 10000

#plt.figure(figsize=(16,8))
for n_iter in [2, 3, 4, 5, 7, 10]:
    print("# of repeat:\t", n_iter)
    result = np.zeros((n_iter, n_data))
    OOBs = []
    OOB_set = []
    for i in range(n_iter):
        rd.seed(i)
        result[i] = rd.randint(1, n_data+1, n_data)
        temp = list(map(np.unique, result))[i].tolist()
        OOBs.append(n_data - len(temp))
        OOB_set.extend(temp)
    print("Average OOB:\t", '{0:.2f}'.format(statistics.mean(OOBs)/n_data*100), "%")
    print("Total OOB:\t", '{0:.2f}'.format((n_data - len(set(OOB_set)))/n_data*100), "%")
    #print(int(time.time() - t0), 'sec')

choshin84 commented 4 years ago

when choosing M data point from universe of N data point, chance of x_i NOT chosen is (N-1 / N)^M. when M = N, the chance will be converged to (1 - 1/N)^N ~ 1/e = 0.3678... when repeat the bootstrap meaning M >> N, then chance will be close to zero

choshin84 / learning_memo

Bootstrap OOB% #56

Tweet summary

Experiment code