[Data] Dev set for parsing accuracy evaluation

nfelnlp commented 1 year ago

We need a custom dataset of prompts (around 100 instances) that we can test our four systems on before the user study.

A test set creation will be separate task done in parallel to the user study (#73).

tanikina commented 1 year ago

Here is the development file with the annotations. Please let me know if I need to change something. dev_set_annotated.txt

nfelnlp commented 1 year ago

Thanks a lot! Overall, this looks very complete to me. This is a sensible selection of questions and operations! Minor things:

I'm not sure if the quotation marks in "{span}" (e.g. l. 109) can be problematic when read in by the system. This needs to be tested (see below).
The last empty line (l. 300) needs to be removed, otherwise the data cannot be read, if I recall recorrectly.

Afterwards, please

Check if it can be read by the system, e.g. as a replacement for the original data ( logic/prompts.py ) or with the code for computing the parsing results ( experiments/compute_parsing_accuracy.py )
Commit the data to the repo
Close the issue

tanikina commented 1 year ago

Thank you for the feedback! I removed the last empty line and filled in all {span} placeholders with some valid phrases. Since {span} is not treated as a feature in the dataset it will not be filled automatically, you are right. So, I just replaced {span} with text. I checked that the file can be loaded in compute_parsing_accuracy.py but I'm having an issue loading the boolq config file, although the file is inside the configs directory it is not found when the gin.parse_config_file is called. It complains with the following message: Screenshot from 2023-05-24 13-26-13 Maybe there are still some things that need to be changed in compute_parsing_accuracy.py? I was using the following parameters: InterroLang/experiments$ python compute_parsing_accuracy.py --model "nearest-neighbor" --dataset boolq --id 0 The model comes from ExplainBot.parsing_model_name in boolq_nn.gin If I comment out this line the following code works fine: testing_data = load_test_data(test_suite) So, I assume that the development set has the right format :) As you suggested I added the file here: https://github.com/nfelnlp/InterroLang/pull/85

tanikina commented 1 year ago

Ok, it works now as expected. But I had to run the script from InterroLang directory. The script does not work from inside InterroLang/experiments.

tanikina commented 1 year ago

Sorry, one more question: after inspecting the output I noticed that {class_names} were not automatically replaced. But these are different for different datasets. Shall I add then a separate version of the validation set for each dataset? There are just a few examples with the class names, so I can replace them quickly. Here they are: https://github.com/nfelnlp/InterroLang/pull/85/commits/a016b5baafeee383372f7437424e5e386bd79035

DFKI-NLP / InterroLang

[Data] Dev set for parsing accuracy evaluation #81