Evan/fix missing records bug

uoefeb commented 1 year ago

Hey @SanniM3 & @ip342 ,

I figured out why we were losing records in the csv file. tldr: all 'music' category videos were getting dropped because of a bad pandas join. why: When I read the category table from the text file, I accidentally dropped the first row, which was music, because the code was reading the first line as a header and then overwriting it with the actual header.

Effect on our experiments:

This is why we lost:

approx 500 records from the training/validation sets combined
approx 250 records from the test set.

All of these would have been category==music videos.

This would suggest that performance on "music" videos in testing would be poor because we didn't fine tune on them, but since "music" videos were excluded from testing as well, hopefully this shouldn't have had too big of an impact, beyond what you would expect by simply having less data overall.

Moving forward

We should probably talk about how we want to handle this. We could consider re-finetuning and repeating our experiments... Or we could just say in our paper that we ran on a reduced dataset.

As for this PR, it fixes the bug... but we might not want to merge it if we decide not to repeat our experiments.

uoefeb commented 1 year ago

update: please see message on teams. I think we might as well merge this, all things considered

uoefeb commented 1 year ago

I'm going to go ahead and merge this.

SanniM3 / video_summarisation_git

Evan/fix missing records bug #22

Effect on our experiments:

Moving forward