Closed AinaIanemahy closed 8 months ago
Okay, so I think Con 2 actually wouldn't be a problem if we still keep the code to process uses and instances.
I totally agree and think we should re-factor this. It is mostly an artifact of the first implementation of this for PhiTag. Could you please instruct us on the changes we need to make for testing?
I will start implementing the changes and then list the changes that you have to make. I think that there will be no changes with respect to the testing proper, but data.py will have to change a bit.
I have created a separate branch for this.
I have created a separate branch for this.
That's okay from my side as long as you take care of merging this branch later. :-)
I have pushed changes to accommodate this in the branch https://github.com/Garrafao/durel_system_annotators/tree/No-uses-and-instances.
I will merge this once you are done working on the stuff that you are currently doing @shafqatvirk @Garrafao . Please give me a thumbs-up.
Annotate will take a file called instances_with_token_index.csv (can be changed) which should have these columns:
["lemma","identifier1", "sentence_left", "token_index_of_sentence_left", "identifier2", "sentence_right", "token_index_of_sentence_right"]
data.py has to be changed accordingly. You can propose different column names if you want to.
I guess changing data.py is on me?
Annotate will take a file called instances_with_token_index.csv (can be changed) which should have these columns:
["lemma","identifier1", "sentence_left", "token_index_of_sentence_left", "identifier2", "sentence_right", "token_index_of_sentence_right"]
data.py has to be changed accordingly. You can propose different column names if you want to.
We have decided to change the column names to ["lemma","identifier1", "context1", "indexes_target_token1", "identifier2", "context2", "indexes_target_token2"]
Annotate will take a file called instances_with_token_index.csv (can be changed) which should have these columns: ["lemma","identifier1", "sentence_left", "token_index_of_sentence_left", "identifier2", "sentence_right", "token_index_of_sentence_right"] data.py has to be changed accordingly. You can propose different column names if you want to.
We have decided to change the column names to ["lemma","identifier1", "context1", "indexes_target_token1", "identifier2", "context2", "indexes_target_token2"]
This is solved with commit 966778c. data.py now outputs a file "instances.csv" in each data set folder with the structure described above. Note that currently we do not have data_split/
so all tests have to be run on the full data sets.
I will open a new issue for the data split and close this issue. We have completed everything that is related to deleting annotation_provider.py
I have been refactoring the annotation provider and I have noticed that almost all of what the annotation provider does (except the creation of the annotations themselves) is not really necessary.
What the annotation provider does is loading the uses and instances files to create the instances_with_token_index, which is then reloaded in xl_lexeme. There are two points which can be optimized:
Cons:
What do you think @Garrafao ?