Open hill-hu opened 7 years ago
For some purpose shuffling drivers lead to better LB score. After this line driver_id became useless.
Thanks @ZFTurbo also for sharing your work! I've been looking at it for several days, trying to replicate the results.
I think the issue is that you're shuffling the training data and target, but not shuffling the driver_id
.
In run_cross_validation_create_models()
you see this code:
train_data, train_target, train_id, driver_id, unique_drivers = read_and_normalize_train_data()
with the shuffling, driver_ids will not match up correctly with train_data
and train_target
which have been shuffled. So, when we run the code below:
unique_list_train = [unique_drivers[i] for i in train_drivers]
X_train, Y_train, train_index = copy_selected_drivers(train_data, train_target, driver_id, unique_list_train)
unique_list_valid = [unique_drivers[i] for i in test_drivers]
X_valid, Y_valid, test_index = copy_selected_drivers(train_data, train_target, driver_id, unique_list_valid)
It's copying across the wrong data, as driver_id
is not correct.
As a result, I think this is messing up the cross validation, and mixing drivers in the training data into the validation data, which is reducing the training and validation loss.
When I fix it with this code, I struggle to get an accuracy over 90% validation accuracy with the model (here are my changes):
# Shuffle experiment START !!!
np.random.seed(42)
perm = permutation(len(train_target))
train_data = train_data[perm]
train_target = train_target[perm]
train_id = [train_id[i] for i in perm]
driver_id = [driver_id[i] for i in perm]
# Shuffle experiment END !!!
In the copy code:
def copy_selected_drivers(train_data, train_target, driver_id, driver_list):
data = []
target = []
index = []
for i in range(len(driver_id)):
if driver_id[i] in driver_list:
data.append(train_data[i])
target.append(train_target[i])
index.append(i)
data = np.array(data)
target = np.array(target)
index = np.array(index)
return data, target, index
we can see that train_data
and train_target
will be matched up as they were correctly shuffled, and will also train correctly as they are matched up. But driver_id
is not correlated. So this function which is trying to copy rows which are in driver_list
will copy incorrect rows, as the drivers are incorrect.
Thus the cross-validation of making sure that drivers don't occur in both training and validation will be broken. Which is why the training and validation accuracy would be higher.
Unless I'm misunderstanding the code?
Note: It's old and dirty code.
You can remove shuffling part (everything from # Shuffle experiment START !!! up to # Shuffle experiment END !!!). It intentionally mix drivers in train and valid part, since in this particular problem it increased accuracy on test set. Validation didn't work well here both ways.
If I write same code now I'd add hard augmentations and rewrote it with fit_generator function to save memory. Also there is not enough data to train from the beginning, only pretrained models can be used here for acceptable accuracy.
Thanks @ZFTurbo for the insight. I've been scratching my head about this competition for the past few weeks. A lot of people are saying not to mix up drivers in the validation sets, but as you say, you seem to get better results! I even have tried augmentation, hide and seek and pseudo labeling, and CAM heat maps, but they all seem to do much better when mixing up the validation set.
Thanks for your reply. Love your work, and thanks for sharing your code too! It was very educational and great for helping people learn about machine learning 👌
Thank you for your sharing ! I read the codes in kaggle_distracted_drivers_vgg16.py When loading train imgs , train_data ,train_target and driver_id are in the same order,but if call 'read_and_normalize_train_data' method as follow:
train_data and train_target has been shuffled ,but not driver_id .That mean they are mapped as before
If it will be a problem?Thanks again