drndr commented 1 year ago

[Natural Language Codesearch Ranking]

The task consists of 8 subtasks and measures cross-lingual and domain generalization, and robustness to covariate shift Includes a ranking evaluation: Given n number of code comment pairs (1 true pair and n-1 distractor pair where a comment has been matched with a random code snippet), calculate the mean reciprocal rank (MRR) score.

Authors

Andor Diera andor.diera@uni-ulm.de
Abdelhalim Dahou Abdelhalim.Dahou@gesis.org
Lukas Galke Lukas@LPAG.de
Fabian Karl fabian.karl@uni-ulm.de
Florian Sihler florian.sihler@uni-ulm.de
Ansgar Scherp ansgar.scherp@uni-ulm.de

Implementation

For the ranking setup a custom task.py is implemented for each subtask, where the get_dataset_raw and evaluate_predictions methods are overriden. For testing the mrr custom task a test case script can be found at nl_codesearch_mrr/codesearchnet_adv/test_mrr_task.py . Note: due to the cbt test suite, the configs files of the mrr tasks include task_type="multiple_choice" and evaluation metrics fields, these were needed for the test suite to execute, but task relies on the custom evaluation implementation.

Usage

For ranking, the evaluate_predictions methods expects a list of dictionary items, where a given key (eg. "score") has a float value. The values should be either logits from binary classification or some similarity metrics of embeddings (cosine, or dot product)

Checklist:

[x] I and my co-authors agree that, if this PR is merged, the code will be available under the same license as the genbench_cbt repository.
[x] Prior to submitting, I have ran the GenBench CBT test suite using the genbench-cli test-task tool.
[x] I have read the description of what should be in the doc.md of my task, and have added the required arguments.
[x] I have submitted or will submit an accompanying paper to the GenBench workshop.

drndr commented 1 year ago

test-task check failed due to ModuleNotFoundError , custom task.py requires the more_itertools library installed

kazemnejad commented 1 year ago

Thanks for bringing up the issue. I see that you're using chunk functionality of more_itertools. Would it be possible for you to bring the implementation of that function into your task.py? Fewer dependencies help us to maintain a more lightweight framework.

vernadankers commented 1 year ago

Hello!

We are getting quite close to the deadline (September 1, 11:59PM anywhere on earth), which is why I wanted to remind you of the fact that your PR still needs some attention: please double-check the automated checks that failed, and take the message that Amir posted last week into account.

Please don't forget to submit your accompanying paper to Openreview via https://openreview.net/group?id=GenBench.org/2023/Workshop by September 1.

Good luck finalising your PR and paper, feel free to tag us if you have questions. Cheers, Verna On behalf of the GenBench team

drndr commented 1 year ago

Hi,

Sorry there were some style issues introduced in the last commit at submission and forgot the rerun the make commands. Fixed these issues in the latest commit.

kazemnejad commented 1 year ago

@drndr We're in the process of merging the tasks into the repo.

Could you please include a single file usage_example.py of each task where you use each task for finetuning and evaluation of a model. (Preferably a pretrained huggingface model). Please also include requirements-usage-example.txt for the python dependencies needed to be installed for running the example.

kazemnejad commented 12 months ago

Hey @drndr. Thanks a lot creating such a clean usage_example.py.

Just one small thing need to be finalized before the merge. I realize that you're loading the data manually in your example. Since the purpose of this example is to show how this task can be used from the genbench framework to load the data and perform the evaluation. Could modify it such it uses the genbench framework to load the task and create the data pipeline? Also, it'd be great to show it one could use task's evaluate_predictions to compute the metrics?

These changes can really help in the adaptation of genbench and its task in a long run.

Thanks in advance.

GenBench / genbench_cbt_2023

[Task Submission] Natural Language Codesearch Ranking (`nl_codesearch_mrr`) #27

[Natural Language Codesearch Ranking]

Authors

Implementation

Usage

Checklist: