Closed nfelnlp closed 1 year ago
Here is the development file with the annotations. Please let me know if I need to change something. dev_set_annotated.txt
Thanks a lot! Overall, this looks very complete to me. This is a sensible selection of questions and operations! Minor things:
"{span}"
(e.g. l. 109) can be problematic when read in by the system. This needs to be tested (see below).Afterwards, please
Thank you for the feedback! I removed the last empty line and filled in all {span} placeholders with some valid phrases. Since {span} is not treated as a feature in the dataset it will not be filled automatically, you are right. So, I just replaced {span} with text.
I checked that the file can be loaded in compute_parsing_accuracy.py but I'm having an issue loading the boolq config file, although the file is inside the configs directory it is not found when the gin.parse_config_file is called.
It complains with the following message:
Maybe there are still some things that need to be changed in compute_parsing_accuracy.py? I was using the following parameters: InterroLang/experiments$ python compute_parsing_accuracy.py --model "nearest-neighbor" --dataset boolq --id 0
The model comes from ExplainBot.parsing_model_name in boolq_nn.gin
If I comment out this line the following code works fine: testing_data = load_test_data(test_suite)
So, I assume that the development set has the right format :)
As you suggested I added the file here: https://github.com/nfelnlp/InterroLang/pull/85
Ok, it works now as expected. But I had to run the script from InterroLang directory. The script does not work from inside InterroLang/experiments.
Sorry, one more question: after inspecting the output I noticed that {class_names} were not automatically replaced. But these are different for different datasets. Shall I add then a separate version of the validation set for each dataset? There are just a few examples with the class names, so I can replace them quickly. Here they are: https://github.com/nfelnlp/InterroLang/pull/85/commits/a016b5baafeee383372f7437424e5e386bd79035
We need a custom dataset of prompts (around 100 instances) that we can test our four systems on before the user study.
A test set creation will be separate task done in parallel to the user study (#73).