Reproduce the best results of the DeepMistake Team for LSCDiscovery

arshandalili commented 1 month ago

Reproduce results from this page:

Run DM+APD on the original test subset of dwug_es (push results to a shared table)
Find how to integrate pre-calculated pairs of usages
Reproduce the results for compare, spearman and JSD, spearman metrics (and add these metrics if they are not there)
Create an integration test_dm_lscdiscovery

Garrafao commented 1 month ago

The possibility to run the benchmark in "inference mode" (without evaluation) is described in the readme: https://github.com/ChangeIsKey/LSCDBenchmark/blob/f14714fc678891199d7b61a1492ffaec3d170611/README.md?plain=1#L69

nvanva commented 1 month ago

@arshandalili Arshan, before closing issues could you please describe how exactly they were solved. For this issue please cite a command we can use to reproduces the results in the benchmark.

arshandalili commented 1 month ago

Reproduced LSCDiscovery COMPARE Result.

COMPARE:

Screenshot 2024-09-28 at 10 15 56 PM

Screenshot_2024-09-22_at_11 26 21_PM

DWUG_ES LSCD_Graded:

Screenshot 2024-09-28 at 10 14 18 PM

Screenshot 2024-09-28 at 10 09 59 PM

nvanva commented 1 month ago

@arshandalili Could you please give a link to these tests in the repo. Please also link where the reproduced results were published, I don't see any models giving COMPARE scores of -0.858 or 0.858 in our paper or repository. We should clearly specify which results we are trying to reproduce here, and if we see any difference how much this difference is.

arshandalili commented 1 month ago

The reference is LSCDiscovery, which has 0.854 COMP Spearman. In the benchmark, we got 0.858, which is roughly equal. The reason for the difference could be the data, but it is very unlikely that the model is not working well.

Also, refer to test_lscd.py.

nvanva commented 1 month ago

I've tried running test_lscd.py, but the test failed (see the screenshot). The worst thing it didn't produce any outputs while running for about 12 hours (I was running on CPU) and it is unclear why it failed.

@arshandalili Could you try running the test from a freshly cloned repo and see if it works for you? Please send a screenshot of your outputs to see how it should work. Also do you know some way of running the test that prints more information for debugging?

arshandalili commented 1 month ago

@nvanva Please try running without pytest. test_lscd.py file is for running the task directly and doesn't have assertions for testing. For that refer to integration tests. Moreover, line 99 of DM.py has set the model to run on GPU, in which you didn't use. So to solve this issue, you'll have to run the test_lscd.py without pytest and modify the line 99 device = 'cuda' if torch.cuda.is_available() else 'cpu'. I can change it.

nvanva commented 1 month ago

@arshandalili Arshan, have you checked that it works from a freshly cloned repo? Could you please check it and send the stdout? Probably it will not fit a single screenshot, so do redirection python test_lscd.py &> outputs.txt) and send outputs.txt! I've spent 2 hours and fixed lots of small technical issues but still doesn't get the result, cannot spend on that more. Please finish you only task - making sure that the COMPARE scores are easily reproducible after cloning the repository from scratch. Just a few issues I remember from this struggle: 1) failure because the directory "results" does not exist, please commit it to the repo (without its content and subdirectories) 3) mlrose==1.3.0 cannot be installed without additional flags to pip because it depends on sklear==0.0

nvanva commented 1 month ago

@arshandalili please also make sure that installation instructions work and we have all necessary dependendencies installed by them in a fresh virtual env

nvanva commented 1 month ago

Ok, after manually installing dependencies the code is running. But the final score is 0.08

nvanva commented 1 month ago

Finally after fixing requirements.txt and other technical issues mentioned above I was able to reproduce the score 0.85 shown above. The problem is it is reproduced only when installing DM from a fork of the DM repo by @arshandalili . If after reproducing I install DM from the official repo I get awful score of 0.08 instead . This probably means that the official repo is not broken. Arshan, please fix this, as we discussed several times before we need working DM in the official repo and we want input that one in the benchmark, not some fork.

arshandalili commented 1 month ago

@nvanva I have already opened a PR well before for the merge, but is open as of now. PR. Please review it.

nvanva commented 1 month ago

@arshandalili PR is merged, the results are now reproduced:

Thank you for fixing that!

arshandalili commented 1 month ago

@nvanva You're welcome!

Garrafao commented 1 month ago

@arshandalili Before closing the issue, please make sure that the test also runs for lscd_graded.

nvanva commented 4 weeks ago

Reproduced LSCDiscovery COMPARE Result.

COMPARE:

DWUG_ES LSCD_Graded:

For the second, I've replaced the the config in LSCD.test.py with the one down on the picture. Don't have the exact error message at hand, but basically it said lscd_graded is not among the tasks supported.

ChangeIsKey / LSCDBenchmark

Reproduce the best results of the DeepMistake Team for LSCDiscovery #50