ZFTurbo / KAGGLE_DISTRACTED_DRIVER

Solutions
91 stars 53 forks source link

if driver_id has been shuffled? #6

Open hill-hu opened 7 years ago

hill-hu commented 7 years ago

Thank you for your sharing ! I read the codes in kaggle_distracted_drivers_vgg16.py When loading train imgs , train_data ,train_target and driver_id are in the same order,but if call 'read_and_normalize_train_data' method as follow:

# Shuffle experiment START !!!
perm = permutation(len(train_target))
train_data = train_data[perm]
train_target = train_target[perm]
# Shuffle experiment END !!!

train_data and train_target has been shuffled ,but not driver_id .That mean they are mapped as before

If it will be a problem?Thanks again

ZFTurbo commented 7 years ago

For some purpose shuffling drivers lead to better LB score. After this line driver_id became useless.

eugeneware commented 7 years ago

Thanks @ZFTurbo also for sharing your work! I've been looking at it for several days, trying to replicate the results.

I think the issue is that you're shuffling the training data and target, but not shuffling the driver_id.

In run_cross_validation_create_models() you see this code:

train_data, train_target, train_id, driver_id, unique_drivers = read_and_normalize_train_data()

with the shuffling, driver_ids will not match up correctly with train_data and train_target which have been shuffled. So, when we run the code below:

unique_list_train = [unique_drivers[i] for i in train_drivers]
X_train, Y_train, train_index = copy_selected_drivers(train_data, train_target, driver_id, unique_list_train)
unique_list_valid = [unique_drivers[i] for i in test_drivers]
X_valid, Y_valid, test_index = copy_selected_drivers(train_data, train_target, driver_id, unique_list_valid)

It's copying across the wrong data, as driver_id is not correct.

As a result, I think this is messing up the cross validation, and mixing drivers in the training data into the validation data, which is reducing the training and validation loss.

When I fix it with this code, I struggle to get an accuracy over 90% validation accuracy with the model (here are my changes):

# Shuffle experiment START !!!
np.random.seed(42)
perm = permutation(len(train_target))
train_data = train_data[perm]
train_target = train_target[perm]
train_id = [train_id[i] for i in perm]
driver_id = [driver_id[i] for i in perm]
 # Shuffle experiment END !!!

In the copy code:

def copy_selected_drivers(train_data, train_target, driver_id, driver_list):
    data = []
    target = []
    index = []
    for i in range(len(driver_id)):
        if driver_id[i] in driver_list:
            data.append(train_data[i])
            target.append(train_target[i])
            index.append(i)
    data = np.array(data)
    target = np.array(target)
    index = np.array(index)
    return data, target, index

we can see that train_data and train_target will be matched up as they were correctly shuffled, and will also train correctly as they are matched up. But driver_id is not correlated. So this function which is trying to copy rows which are in driver_list will copy incorrect rows, as the drivers are incorrect.

Thus the cross-validation of making sure that drivers don't occur in both training and validation will be broken. Which is why the training and validation accuracy would be higher.

Unless I'm misunderstanding the code?

ZFTurbo commented 7 years ago

Note: It's old and dirty code.

You can remove shuffling part (everything from # Shuffle experiment START !!! up to # Shuffle experiment END !!!). It intentionally mix drivers in train and valid part, since in this particular problem it increased accuracy on test set. Validation didn't work well here both ways.

If I write same code now I'd add hard augmentations and rewrote it with fit_generator function to save memory. Also there is not enough data to train from the beginning, only pretrained models can be used here for acceptable accuracy.

eugeneware commented 7 years ago

Thanks @ZFTurbo for the insight. I've been scratching my head about this competition for the past few weeks. A lot of people are saying not to mix up drivers in the validation sets, but as you say, you seem to get better results! I even have tried augmentation, hide and seek and pseudo labeling, and CAM heat maps, but they all seem to do much better when mixing up the validation set.

Thanks for your reply. Love your work, and thanks for sharing your code too! It was very educational and great for helping people learn about machine learning 👌