abjer / sds

Social Data Science - a summer school course
https://abjer.github.io/sds
18 stars 34 forks source link

Multiprocessing Pool dont work on Windows 10 #28

Closed KarlTjensvoll closed 6 years ago

KarlTjensvoll commented 6 years ago

I can use Pool to paralellize my work on my macbook, but when I try on my more powerful Windows to run the code faster, it does not work. None of the cores start doing any work and for a sample of a 100 I let it run for 10 minutes, but nothing happens.

So I use this code:

def tree_paralel(x):
    tree = DecisionTreeClassifier(criterion="gini", max_depth= x, random_state=1)  
    accuracy_ = []
    for train_idx, val_idx in kfolds.split(X_dev, y_dev):

        X_train, y_train, = X_dev.iloc[train_idx], y_dev.iloc[train_idx]
        X_val, y_val = X_dev.iloc[val_idx], y_dev.iloc[val_idx] 

        X_train = pd.DataFrame(im.fit_transform(X_train),index = X_train.index)
        X_val = pd.DataFrame(im.transform(X_val), index = X_val.index)
        tree.fit(X_train, y_train)
        y_pred = tree.predict(X_val)
        accuracy_.append(accuracy_score(y_val, y_pred))
    print("This was the "+str(x)+" iteration", (dt.now() - start).total_seconds())
    return accuracy_

and then run:

start = dt.now()
p = Pool(4)

input_ = range(1,11)
output_ = []
accuracy = []
for result in p.imap(tree_paralel, input_):
    output_.append(result)
p.close()
temp = pd.DataFrame(output_).mean(axis = 1)
temp.index = input_
optimal_t = temp.nlargest(1)
print("Time:", (dt.now() - start).total_seconds())
print("Optimal hyperparameter: "+ str(optimal_t.index[0]) + " with accuracy: " + str(optimal_t.values) )
abjer commented 6 years ago

One fix is to use the syntax:

if __name__ == '__main__': 
    YOUR_SCRIPT()

This fix is explained in detail here. If you want to learn more about the deeper difference you can read more about that here.

To get consistent behavior I recommend using the module ipyparalllel but it is somewhat more difficult to apply.