Open arshandalili opened 1 month ago
The possibility to run the benchmark in "inference mode" (without evaluation) is described in the readme: https://github.com/ChangeIsKey/LSCDBenchmark/blob/f14714fc678891199d7b61a1492ffaec3d170611/README.md?plain=1#L69
@arshandalili Arshan, before closing issues could you please describe how exactly they were solved. For this issue please cite a command we can use to reproduces the results in the benchmark.
Reproduced LSCDiscovery COMPARE Result.
COMPARE:
DWUG_ES LSCD_Graded:
@arshandalili Could you please give a link to these tests in the repo. Please also link where the reproduced results were published, I don't see any models giving COMPARE scores of -0.858 or 0.858 in our paper or repository. We should clearly specify which results we are trying to reproduce here, and if we see any difference how much this difference is.
The reference is LSCDiscovery, which has 0.854 COMP Spearman. In the benchmark, we got 0.858, which is roughly equal. The reason for the difference could be the data, but it is very unlikely that the model is not working well.
Also, refer to test_lscd.py.
I've tried running test_lscd.py, but the test failed (see the screenshot). The worst thing it didn't produce any outputs while running for about 12 hours (I was running on CPU) and it is unclear why it failed.
@arshandalili Could you try running the test from a freshly cloned repo and see if it works for you? Please send a screenshot of your outputs to see how it should work. Also do you know some way of running the test that prints more information for debugging?
@nvanva Please try running without pytest
. test_lscd.py file is for running the task directly and doesn't have assertions for testing. For that refer to integration tests. Moreover, line 99 of DM.py has set the model to run on GPU, in which you didn't use. So to solve this issue, you'll have to run the test_lscd.py without pytest and modify the line 99 device = 'cuda' if torch.cuda.is_available() else 'cpu'
. I can change it.
@arshandalili Arshan, have you checked that it works from a freshly cloned repo? Could you please check it and send the stdout? Probably it will not fit a single screenshot, so do redirection python test_lscd.py &> outputs.txt
) and send outputs.txt!
I've spent 2 hours and fixed lots of small technical issues but still doesn't get the result, cannot spend on that more. Please finish you only task - making sure that the COMPARE scores are easily reproducible after cloning the repository from scratch.
Just a few issues I remember from this struggle:
1) failure because the directory "results" does not exist, please commit it to the repo (without its content and subdirectories)
3) mlrose==1.3.0 cannot be installed without additional flags to pip because it depends on sklear==0.0
@arshandalili please also make sure that installation instructions work and we have all necessary dependendencies installed by them in a fresh virtual env
Ok, after manually installing dependencies the code is running. But the final score is 0.08
Finally after fixing requirements.txt and other technical issues mentioned above I was able to reproduce the score 0.85 shown above. The problem is it is reproduced only when installing DM from a fork of the DM repo by @arshandalili . If after reproducing I install DM from the official repo I get awful score of 0.08 instead . This probably means that the official repo is not broken. Arshan, please fix this, as we discussed several times before we need working DM in the official repo and we want input that one in the benchmark, not some fork.
@nvanva I have already opened a PR well before for the merge, but is open as of now. PR. Please review it.
@arshandalili PR is merged, the results are now reproduced:
Thank you for fixing that!
@nvanva You're welcome!
@arshandalili Before closing the issue, please make sure that the test also runs for lscd_graded.
Reproduced LSCDiscovery COMPARE Result.
COMPARE:
DWUG_ES LSCD_Graded:
For the second, I've replaced the the config in LSCD.test.py with the one down on the picture. Don't have the exact error message at hand, but basically it said lscd_graded is not among the tasks supported.
Reproduce results from this page: