clamsproject / app-swt-detection

CLAMS app for detecting scenes with text from video input
Apache License 2.0
1 stars 0 forks source link

positional encoding is not working #113

Closed keighrim closed 4 months ago

keighrim commented 4 months ago

Bug Description

When I run training round with different pos_enc_coeff values, the results don't seem to be vary as shown in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2207028702

We need further investigation to make sure the effectiveness of positional encoding

Reproduction steps

some results from the latest round under 7be4b818a0c72713e501b27be9ebaeee5a3e1320

image

Expected behavior

No response

Log output

No response

Screenshots

No response

Additional context

No response

kla7 commented 4 months ago

Gridsearch results

With the adjustments made in #114, I created heatmaps per label to analyze the results.

The gridsearch configuration is as follows:

num_splits = {2}
num_epochs = {10}
num_layers = {4}
pos_unit = {60000}
dropouts = {0.1}
img_enc_name = {'convnext_lg', 'convnext_tiny'}
pos_length = {6000000}
pos_abs_th_front = {0, 3, 5, 10}
pos_abs_th_end = {0, 3, 5, 10}
pos_vec_coeff = {0, 1, 0.75, 0.5, 0.25}

Note: For the results, I focused on those using convnext_lg for img_enc since the scores seemed to be higher than with convnext_tiny.

As @owencking suggested, this time I analyzed recall scores and put more focus on chyrons. The heatmaps are included below for reference. The average scores in the bottom row of each table do not include the scores when pos_vec_coeff = 0, so that the analysis is done only when pos_enc is enabled; however, pos_vec_coeff = 0 scores are included for comparison when selecting ideal hyperparameter configurations.

Label I

image

Since the focus is on chyrons this time, I started by observing the scores in this heatmap first. Right off the bat, it's clear that the scores tended to be higher when pos_vec_coeff = 0. I noticed that pos_enc_coeff = 0.5 had fairly high scores when pos_abs_th_end is 3 or 10, with the latter resulting in higher scores than the former. With that, I wanted to look at the other labels when pos_enc_coeff = 0.5 and pos_abs_th_end = 10, focusing on when pos_abs_th_front is 3, 5, or 10.

Label B

image

When pos_enc_coeff = 0.5 and pos_abs_th_end = 10, the resulting score is the highest when pos_abs_th_front is 5.

Label C

image

When pos_enc_coeff = 0.5 and pos_abs_th_end = 10, the resulting score is the highest when pos_abs_th_front is 3.

Label S

image

When pos_enc_coeff = 0.5 and pos_abs_th_end = 10, the resulting score is the highest when pos_abs_th_front is 5.

Conclusion

With these findings, I believe an ideal configuration for the three hyperparameters is as follows:

pos_abs_th_front: 5
pos_abs_th_end: 10
pos_enc_coeff: 0.5

Interestingly, this is nearly identical to the ideal configuration found in https://github.com/clamsproject/app-swt-detection/issues/100#issuecomment-2207028702, except pos_abs_th_front = 3.

Comparing pos_enc performances

With the hyperparameter values mentioned above, I performed gridsearch again with the following configuration:

num_splits = {2}
num_epochs = {10}
num_layers = {4}
pos_unit = {60000}
pos_enc_dim = {256}
dropouts = {0.1}
img_enc_name = {'convnext_lg'}
pos_length = {6000000}
pos_abs_th_front = {5}
pos_abs_th_end = {10}
pos_vec_coeff = {0, 0.5}

I separately performed gridsearch with the above configuration but changed pos_abs_th_front to 3, so that I can observe any differences between the two values since that was the ideal configuration when analyzing F1 scores. Results from both configurations are included below for comparison:

pos_abs_th_front = 3 pos_abs_th_front = 5
image image
image image
image image
image image

For label I, recall is now higher when pos_enc is enabled. The F1 score is still lower when pos_enc is enabled compared to when it is not enabled, and the difference of F1 scores between the pos_vec_coeff values is actually is now larger when pos_abs_th_front = 5. Since recall is the main focus this time, the results seem more promising than before.

For label B, there is not much difference between the original configuration compared to the new one, but it is noteworthy that both recall and F1 scores are a tad bit higher with the new configuration.

For label C and label S, it seems that both the recall and F1 scores improved slightly compared to the original configuration.

Concluding thoughts