Closed keighrim closed 4 months ago
So after recent discussion, we decided to first try to implement the first "hybrid" approach. Concretely, this will
updating this method
https://github.com/clamsproject/app-swt-detection/blob/dede3efa0504ffb5cd8f350d4e55ee45e036a14d/modeling/data_loader.py#L95
to have constant value for n_pos
in a high enough number (we can empirically obtains the best size for this, by hyperparameterizing it)
then, updating
https://github.com/clamsproject/app-swt-detection/blob/dede3efa0504ffb5cd8f350d4e55ee45e036a14d/modeling/data_loader.py#L83-L85
to add self.pos_abs_th_front
and self_pos_abs_th_end
attributes to configure first N mins (and last M mins) of absolute lookup
and finally, updating
https://github.com/clamsproject/app-swt-detection/blob/dede3efa0504ffb5cd8f350d4e55ee45e036a14d/modeling/data_loader.py#L119
to look up positional vector using self.pos_abs_...
attirbutes. Specifically, something like this
if cur_time < self.pos_abs_th_front or tot_tim - cur_time < self.pos_abs_th_end:
pos_lookup_col = cur_time
else:
pos_lookup_col = cur_time / tot_time * self.pos_vec_loopup.shape[1]
pos_vec = self.pos_vec_lookup[pos_lookup_col])
In addition to that, it'd be a good idea to add one more argument to encode_position
method to regularize the impact of positional encoding. Something like this;
def encode_position(self, cur_time, tot_time, img_vec, pos_vec_coeff):
...
pos_vec = self.pos_vec_lookup[pos_lookup_col]) * pos_vec_coeff
With the three new hyperparameters pos_abs_th_front
, pos_abs_th_end
, and pos_enc_coeff
, I conducted gridsearch with the following hyperparameter values:
num_splits = {20}
num_epochs = {10}
num_layers = {4}
pos_enc_name = {"sinusoidal-add"}
input_length = {6000000}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {"convnext_lg"}
pos_abs_th_front = {0, 3, 5, 10}
pos_abs_th_end = {0, 3, 5, 10}
pos_enc_coeff = {1, 0.75, 0.5, 0.25}
Using see_results.py
to retrieve visualizations of every possible hyperparameter configuration, I looked through each label's results to determine which configuration gives the best F1-score. Some of the labels had particularly low F1-scores so I decided to focus on B and I. Below are the compiled F1-score results from labels B and I put into spreadsheets.
From the image above, it seems that some of the highest F1-scores result for label B when pos_abs_th_front
is 0 or 10 and pos_abs_th_end
is 3 or 10. Among those scores, pos_enc_coeff
seems to result in higher scores when its value is set to 0.5, 0.75, or 1.
Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label B: |
pos_abs_th_end = 3 |
pos_abs_th_end = 10 |
---|---|---|
The F1-score is 0.9296 when pos_abs_th_front = 0 and 0.9357 when pos_abs_th_front = 10. pos_enc_coeff = 0.5. |
The F1-score is 0.9394 when pos_abs_th_front = 0 and 0.9359 when pos_abs_th_front = 10. pos_enc_coeff = 0.5. |
|
The F1-score is 0.9444 when pos_abs_th_front = 0 and 0.9458 when pos_abs_th_front = 10. pos_enc_coeff = 0.75. |
The F1-score is 0.9450 when pos_abs_th_front = 0 and 0.9417 when pos_abs_th_front = 10. pos_enc_coeff = 0.75. |
|
The F1-score is 0.9356 when pos_abs_th_front = 0 and 0.9428 when pos_abs_th_front = 10. pos_enc_coeff = 1. |
The F1-score is 0.9384 when pos_abs_th_front = 0 and 0.9400 when pos_abs_th_front = 10. pos_enc_coeff = 1. |
From the image above, it seems that some of the highest F1-scores result for label I when pos_abs_th_front
is 0 and pos_abs_th_end
is 5 or 10. Among those scores, pos_enc_coeff
seems to result in higher scores when its value is set to 0.75.
Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label I: |
pos_abs_th_end = 5 |
pos_abs_th_end = 10 |
---|---|---|
The F1-score is 0.7654 when pos_abs_th_front = 0 and pos_enc_coeff = 0.75. |
The F1-score is 0.7851 when pos_abs_th_front = 0 and pos_enc_coeff = 0.75. |
Wow, this is helpful way of analyzing the results. From our domain knowledge, I suspect the labels that are most impacted by the positional information would be slates (S
) (almost always occur in first a few mins) and credits (C
) (always toward the end). Considering that, here's some requests
C
and S
labels? pos_enc_name = {"none", "sinusoidal-add"}
with all pos_end related parameters are fixed? Here are the F1-score results for labels S and C compiled in spreadsheets after performing gridsearch previously.
From the image above, it seems that some of the highest F1-scores result for label S when pos_abs_th_front
is 0 or 10 and pos_abs_th_end
is 5. Among those scores, pos_enc_coeff
seems to result in higher scores when its value is set to 0.5 or 1.
Here are the plots for the above-mentioned configurations retrieved from running see_results.py on label S: | pos_abs_th_end = 5 |
---|---|
The F1-score is 0.5830 when pos_abs_th_front = 0 and 0.6666 when pos_abs_th_front = 10. pos_enc_coeff = 0.5. |
|
The F1-score is 0.5939 when pos_abs_th_front = 0 and 0.6009 when pos_abs_th_front = 10. pos_enc_coeff = 1. |
From the image above, it seems that some of the highest F1-scores result for label C when pos_abs_th_front
is 0 or 5 and pos_abs_th_end
is 5. Among those scores, pos_enc_coeff
seems to result in higher scores when its value is set to 0.25 or 0.75.
pos_abs_th_end = 5 |
---|
The F1-score is 0.5153 when pos_abs_th_front = 0 and 0.5044 when pos_abs_th_front = 5. pos_enc_coeff = 0.25. |
The F1-score is 0.4998 when pos_abs_th_front = 0 and 0.5346 when pos_abs_th_front = 5. pos_enc_coeff = 0.75. |
Using fixed values found from the previous observations for the three hyperparameters pos_abs_th_front
, pos_abs_th_end
, and pos_enc_coeff
, I conducted gridsearch with the following hyperparameter values:
num_splits = {20}
num_epochs = {10}
num_layers = {4}
pos_enc_name = {"none", "sinusoidal-add"}
input_length = {6000000}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {"convnext_lg"}
pos_abs_th_front = {0}
pos_abs_th_end = {5}
pos_enc_coeff = {0.75}
I once again used see_results.py
to retrieve visualizations of the two possible hyperparameter configurations, each using a different pos_encoder
set by pos_enc_name
. Below are the plots retrieved from see_results.py
for labels B, C, I, and S to observe the performance difference of using the sinusoidal-add
or no using no pos_encoder
.
The results seem about equal when using sinusoidal-add
or no pos_encoder
.
While these numbers are quite low on their own, it is interesting to note that the recall for sinusoidal-add
is 0.1 points higher than using no pos_encoder
. This may suggest that sinusoidal-add
is slightly better at detecting credit scenes than using no pos_encoder
.
For label I, it seems that using sinusoidal-add
is no better than using no pos_encoder
. After observing the heat map for label I from the previous comment, it may be worth performing gridsearch again but with pos_abs_th_end
= 10 instead, since the resulting F1-score for the relevant configuration was about as high as using no pos_encoder
this time around.
Again, these numbers are quite low to begin with and there is no significant difference, but it is interesting to note that sinusoidal-add
performs better than using no pos_encoder
across precision, recall, and F1-score by about a 0.07 point difference. This may suggest that sinusoidal-add
is slightly better at detecting scenes containing slates than using no pos_encoder
.
So it looks like our hypothesis proves most true here, except that positional encoding "hurts" prediction performance for I
(chyron) class - I hypothesized that the positional encoding won't make significant differences for those classes usually occur in the middle of input stream.
Maybe for the upcoming rounds of experiment, we can also try to see the impact of pos_enc
in terms of the input videos lengths (duration) i.e., does pos_enc
works better for 30-min videos than 60-min ones?
The f1-scores worry me a bit since some of them seem to be very close to the lowest of precision and recall. For example for label S we have P=0.7083 and R=0.5910, but F=0.5964, where ny back-of-the-napkin calculations has it at 0.64, which intuitively makes more sense to me.
In one case, label C, it is even below the lowest of P&R.
The line of code to retrieve the relative position was incorrect, so I have altered it and re-ran gridsearch using the same hyperparameters as in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2195556629.
The following line https://github.com/clamsproject/app-swt-detection/blob/43cc4d59a4a825c5fe3fc4e9a7fcc468cbffaf80/modeling/data_loader.py#L141
was changed to be pos_lookup_col = cur_time * self.pos_vec_lookup.shape[0] // tot_time
.
I have also opted to recreate the visualizations with the correct F1-scores using Google Sheets following the F1 calculation issue found by @marcverhagen (I wasn't able to determine the source of the issue).
The results still suggest that sinusoidal-add
can detect bars about as well as using no pos_enc
.
The results now show a greater difference in favor of using sinusoidal-add
, with an increase of 0.1 points compared to no pos_enc
, suggesting that closing credits can be detected better with positional encoding.
The results still show that sinusoidal-add
does not detect chryons any better than no pos_enc
, which is what was found previously.
The results now show a significant difference in favor of using sinusoidal-add
, pushing the F1-score to 0.8, nearly a 0.25 point increase from using no pos_enc
. This suggests that positional encoding performs fairly well for detecting slates.
My next plan is to perform gridsearch again with the configuration from https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2192778483 to see if there is any improvement following the change in the script. The F1-scores will be more accurate in the next gridsearch report.
Regarding the unexpected range of F-1 scores, this is because the result aggregation/plotting script is calculating arithmetic means of P, R, F numbers from all k-fold rounds independently from each other.
The following are results from running gridsearch using the same hyperparameters as in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2192778483 following the change in the script. The format is the heatmap created in spreadsheets as before and the values shown are the average F1-scores, all retrieved from visualization outputs from see_results.py
.
While the differences are very minimal, it seems that some of the highest F1-scores result when pos_abs_th_front
is 3 or 5 and pos_abs_th_end
is 5 or 10. Among those scores, pos_enc_coeff
seems to result in higher scores when its value is set to 0.5.
Compared to the results found in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2195500572, these scores look a lot better; however, they are still fairly low generally. Some of the highest F1-scores result when pos_abs_th_front
is 3 or 5, pos_abs_th_end
is 5 or 10, and pos_enc_coeff
is 0.5.
These results look fairly similar to the ones found in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2192778483. Some of the highest F1-scores result when pos_abs_th_front
is 0 or 3, pos_abs_th_end
is 3 or 10, and pos_enc_coeff
is 0.5 or 1.
This label seems to have the most drastic (positive) change compared the previous gridsearch results in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2195500572. Some of the highest F1-scores result when pos_abs_th_front
is 3 or 10, pos_abs_th_end
is 5 or 10, and pos_enc_coeff
is 0.75 or 1.
With these findings, I believe that an ideal configuration for the three hyperparameters is as follows:
pos_abs_th_front: 3
pos_abs_th_end: 10
pos_enc_coeff: 0.5
pos_enc
performancesUsing the configuration mentioned above, we can compare the performance of using sinusoidal-add
as opposed to no pos_enc
for the model.
As found previously, there doesn't seem to be much of a difference in performance between using positional encoding compared to not using a pos_enc
for detecting bars.
These results show that using positional encoding may allow for the model to detect closing credits better than not using a pos_enc
.
Once again, these results show that using positional encoding does not perform any better than using no pos_enc
for detecting chyrons and in fact might be slightly worse.
These results are as drastic as in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2201394732 where it appears that positional encoding performs nearly 0.3 points better than using no pos_enc
, which suggests that the model can detect slates fairly well using sinusoidal-add
.
New Feature Summary
Since in the first rounds of training, we used 94 min (the length of the longest video in the training data in those rounds) hard-cap on the sinusoidal positional vectors. However, we now realized that
So before moving on to the next rounds of training (with "hard" examples Owen is currently annotating), we'd like to tweak the positional encoding, and make sure the experiment results we saw in the first rounds (absolute encoding performed the best) are reproducible.
Few ideas of other hybrid positional encoding
Related
No response
Alternatives
No response
Additional context
No response