Closed drndr closed 10 months ago
test-task check failed due to ModuleNotFoundError , custom task.py requires the more_itertools library installed
Thanks for bringing up the issue. I see that you're using chunk
functionality of more_itertools
. Would it be possible for you to bring the implementation of that function into your task.py? Fewer dependencies help us to maintain a more lightweight framework.
Hello!
We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), which is why I wanted to remind you of the fact that your PR still needs some attention: please double-check the automated checks that failed, and take the message that Amir posted last week into account.
Please don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.
Good luck finalising your PR and paper, feel free to tag us if you have questions. Cheers, Verna On behalf of the GenBench team
Hi,
Sorry there were some style issues introduced in the last commit at submission and forgot the rerun the make commands. Fixed these issues in the latest commit.
@drndr We're in the process of merging the tasks into the repo.
Could you please include a single file usage_example.py
of each task where you use each task for finetuning and evaluation of a model. (Preferably a pretrained huggingface model). Please also include requirements-usage-example.txt
for the python dependencies needed to be installed for running the example.
Hey @drndr. Thanks a lot creating such a clean usage_example.py.
Just one small thing need to be finalized before the merge. I realize that you're loading the data manually in your example. Since the purpose of this example is to show how this task can be used from the genbench framework to load the data and perform the evaluation. Could modify it such it uses the genbench framework to load the task and create the data pipeline? Also, it'd be great to show it one could use task's evaluate_predictions
to compute the metrics?
These changes can really help in the adaptation of genbench and its task in a long run.
Thanks in advance.
[Natural Language Codesearch Ranking]
The task consists of 8 subtasks and measures cross-lingual and domain generalization, and robustness to covariate shift Includes a ranking evaluation: Given n number of code comment pairs (1 true pair and n-1 distractor pair where a comment has been matched with a random code snippet), calculate the mean reciprocal rank (MRR) score.
Authors
andor.diera@uni-ulm.de
Abdelhalim.Dahou@gesis.org
Lukas@LPAG.de
fabian.karl@uni-ulm.de
florian.sihler@uni-ulm.de
ansgar.scherp@uni-ulm.de
Implementation
For the ranking setup a custom task.py is implemented for each subtask, where the get_dataset_raw and evaluate_predictions methods are overriden. For testing the mrr custom task a test case script can be found at nl_codesearch_mrr/codesearchnet_adv/test_mrr_task.py . Note: due to the cbt test suite, the configs files of the mrr tasks include task_type="multiple_choice" and evaluation metrics fields, these were needed for the test suite to execute, but task relies on the custom evaluation implementation.
Usage
For ranking, the evaluate_predictions methods expects a list of dictionary items, where a given key (eg. "score") has a float value. The values should be either logits from binary classification or some similarity metrics of embeddings (cosine, or dot product)
Checklist:
genbench-cli test-task
tool.