WiC Unit and Integration Test

ChangeIsKey / LSCDBenchmark

7 stars 1 forks source link

WiC Unit and Integration Test #45

Open arshandalili opened 4 months ago

arshandalili commented 4 months ago

Complete the Unit and Integration Tests for the WiC task.

arshandalili commented 3 months ago

[x] Check the Dataset preprocessing, label aggregation, and dataset cleaning, and sanity checks with some simple words and lemmas.
[x] Understand and optimize the YAMLs.

arshandalili commented 3 months ago

Adding the below results as integration tests:

Annotator Data accuracy correlation p-value XL-Lexeme-Cosine dwug_en_median NA 0.598 0.0
Annotator Data accuracy correlation p-value XL-Lexeme-Cosine dwug_sv_median NA 0.573 0.0
Annotator Data accuracy correlation p-value XL-Lexeme-Cosine dwug_de_median NA 0.61 0.0

Also, based on the paper:

arshandalili commented 3 months ago

Problems with dataset loading: It happens when we first load a dataset on the test_on mode.

arshandalili commented 3 months ago

Possible problems with benchmark:

Data Loading (e.g., aggregation, target)
Model (inference, caching)
Evaluation

arshandalili commented 3 months ago

[x] The embeddings for XL-LEXEME model doesn't seem equal to the ones we obtain in the WiC model.
[x] Enhance similarity metrics.

nvanva commented 3 weeks ago

@arshandalili Arshan, please describe what is exactly done regarding this issue

arshandalili commented 2 weeks ago

XL-Lexeme en results has been reproduced using this config:

Screenshot 2024-09-28 at 9 41 26 PM

Result:

Screenshot 2024-09-28 at 10 10 04 PM

nvanva commented 2 weeks ago

@arshandalili Two questions: 1) Is this config on your screenshot also committed to the repo, could you please give a link? 2) When talking about the results being "reproduced" we usually mean that we have a script/command that gets the same results on the same dataset as some published or previously known results. I see you got Spearman's correlation 0.623 on dwug_en_200. Which exactly results this reproduce?

I see above the following results we wanted to reproduce: "XL-Lexeme-Cosine dwug_en_median NA 0.598 0.0", what is the difference between dwug_en_median and dwug_en_200?

arshandalili commented 2 weeks ago

For XL-Lexeme, the difference in the results may be due to data, but the goal was to ensure that the model works reasonably, i.e., it doesn't contain bugs; otherwise, there would be a huge difference.

The only current way for config is to run the config in a command. Refer to README.

Also, refer to test_wic.py.