I figured out why we were losing records in the csv file.
tldr: all 'music' category videos were getting dropped because of a bad pandas join.
why: When I read the category table from the text file, I accidentally dropped the first row, which was music, because the code was reading the first line as a header and then overwriting it with the actual header.
Effect on our experiments:
This is why we lost:
approx 500 records from the training/validation sets combined
approx 250 records from the test set.
All of these would have been category==music videos.
This would suggest that performance on "music" videos in testing would be poor because we didn't fine tune on them, but since "music" videos were excluded from testing as well, hopefully this shouldn't have had too big of an impact, beyond what you would expect by simply having less data overall.
Moving forward
We should probably talk about how we want to handle this. We could consider re-finetuning and repeating our experiments... Or we could just say in our paper that we ran on a reduced dataset.
As for this PR, it fixes the bug... but we might not want to merge it if we decide not to repeat our experiments.
Hey @SanniM3 & @ip342 ,
I figured out why we were losing records in the csv file. tldr: all 'music' category videos were getting dropped because of a bad pandas join. why: When I read the category table from the text file, I accidentally dropped the first row, which was music, because the code was reading the first line as a header and then overwriting it with the actual header.
Effect on our experiments:
This is why we lost:
All of these would have been
category==music
videos.This would suggest that performance on "music" videos in testing would be poor because we didn't fine tune on them, but since "music" videos were excluded from testing as well, hopefully this shouldn't have had too big of an impact, beyond what you would expect by simply having less data overall.
Moving forward
We should probably talk about how we want to handle this. We could consider re-finetuning and repeating our experiments... Or we could just say in our paper that we ran on a reduced dataset.
As for this PR, it fixes the bug... but we might not want to merge it if we decide not to repeat our experiments.