DFKI-NLP / InterroLang

InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations [EMNLP 2023 Findings]
https://arxiv.org/abs/2310.05592
5 stars 1 forks source link

[Data] Dev set for parsing accuracy evaluation #81

Closed nfelnlp closed 1 year ago

nfelnlp commented 1 year ago

We need a custom dataset of prompts (around 100 instances) that we can test our four systems on before the user study.

A test set creation will be separate task done in parallel to the user study (#73).

tanikina commented 1 year ago

Here is the development file with the annotations. Please let me know if I need to change something. dev_set_annotated.txt

nfelnlp commented 1 year ago

Thanks a lot! Overall, this looks very complete to me. This is a sensible selection of questions and operations! Minor things:

Afterwards, please

  1. Check if it can be read by the system, e.g. as a replacement for the original data ( logic/prompts.py ) or with the code for computing the parsing results ( experiments/compute_parsing_accuracy.py )
  2. Commit the data to the repo
  3. Close the issue
tanikina commented 1 year ago

Thank you for the feedback! I removed the last empty line and filled in all {span} placeholders with some valid phrases. Since {span} is not treated as a feature in the dataset it will not be filled automatically, you are right. So, I just replaced {span} with text. I checked that the file can be loaded in compute_parsing_accuracy.py but I'm having an issue loading the boolq config file, although the file is inside the configs directory it is not found when the gin.parse_config_file is called. It complains with the following message: Screenshot from 2023-05-24 13-26-13 Maybe there are still some things that need to be changed in compute_parsing_accuracy.py? I was using the following parameters: InterroLang/experiments$ python compute_parsing_accuracy.py --model "nearest-neighbor" --dataset boolq --id 0 The model comes from ExplainBot.parsing_model_name in boolq_nn.gin If I comment out this line the following code works fine: testing_data = load_test_data(test_suite) So, I assume that the development set has the right format :) As you suggested I added the file here: https://github.com/nfelnlp/InterroLang/pull/85

tanikina commented 1 year ago

Ok, it works now as expected. But I had to run the script from InterroLang directory. The script does not work from inside InterroLang/experiments.

tanikina commented 1 year ago

Sorry, one more question: after inspecting the output I noticed that {class_names} were not automatically replaced. But these are different for different datasets. Shall I add then a separate version of the validation set for each dataset? There are just a few examples with the class names, so I can replace them quickly. Here they are: https://github.com/nfelnlp/InterroLang/pull/85/commits/a016b5baafeee383372f7437424e5e386bd79035