Large parts of annotation provider unnecessary?

AinaIanemahy commented 8 months ago

I have been refactoring the annotation provider and I have noticed that almost all of what the annotation provider does (except the creation of the annotations themselves) is not really necessary.

What the annotation provider does is loading the uses and instances files to create the instances_with_token_index, which is then reloaded in xl_lexeme. There are two points which can be optimized:

Why do we save the instances_with_token_index when we could simply pass them on to xl_lexeme.
More importantly: Why do we use uses and instances files at all when we can simply download instances_with_token_index from DURel directly (That would mean doing a minor change in DURel)? I assume that this would save us a lot of time when it comes to larger datasets.

Cons:

Going with option two would require us to do some more refactoring and changing the format of the test_data. But I assume that our main goal is not to have little work here but to make run larger datasets more smoothly.
Adding a new annotator would possibly become a bit more difficult (we might have to implement a new endpoint in DURel if we do so).

What do you think @Garrafao ?

AinaIanemahy commented 8 months ago

Okay, so I think Con 2 actually wouldn't be a problem if we still keep the code to process uses and instances.

Garrafao commented 8 months ago

I totally agree and think we should re-factor this. It is mostly an artifact of the first implementation of this for PhiTag. Could you please instruct us on the changes we need to make for testing?

AinaIanemahy commented 8 months ago

I will start implementing the changes and then list the changes that you have to make. I think that there will be no changes with respect to the testing proper, but data.py will have to change a bit.

AinaIanemahy commented 8 months ago

I have created a separate branch for this.

Garrafao commented 8 months ago

I have created a separate branch for this.

That's okay from my side as long as you take care of merging this branch later. :-)

AinaIanemahy commented 8 months ago

I have pushed changes to accommodate this in the branch https://github.com/Garrafao/durel_system_annotators/tree/No-uses-and-instances.

I will merge this once you are done working on the stuff that you are currently doing @shafqatvirk @Garrafao . Please give me a thumbs-up.

AinaIanemahy commented 8 months ago

Annotate will take a file called instances_with_token_index.csv (can be changed) which should have these columns:

["lemma","identifier1", "sentence_left", "token_index_of_sentence_left", "identifier2", "sentence_right", "token_index_of_sentence_right"]

data.py has to be changed accordingly. You can propose different column names if you want to.

Garrafao commented 8 months ago

I guess changing data.py is on me?

AinaIanemahy commented 8 months ago

Annotate will take a file called instances_with_token_index.csv (can be changed) which should have these columns:

["lemma","identifier1", "sentence_left", "token_index_of_sentence_left", "identifier2", "sentence_right", "token_index_of_sentence_right"]

data.py has to be changed accordingly. You can propose different column names if you want to.

We have decided to change the column names to ["lemma","identifier1", "context1", "indexes_target_token1", "identifier2", "context2", "indexes_target_token2"]

Garrafao commented 8 months ago

Annotate will take a file called instances_with_token_index.csv (can be changed) which should have these columns: ["lemma","identifier1", "sentence_left", "token_index_of_sentence_left", "identifier2", "sentence_right", "token_index_of_sentence_right"] data.py has to be changed accordingly. You can propose different column names if you want to.

We have decided to change the column names to ["lemma","identifier1", "context1", "indexes_target_token1", "identifier2", "context2", "indexes_target_token2"]

This is solved with commit 966778c. data.py now outputs a file "instances.csv" in each data set folder with the structure described above. Note that currently we do not have data_split/ so all tests have to be run on the full data sets.

AinaIanemahy commented 8 months ago

I will open a new issue for the data split and close this issue. We have completed everything that is related to deleting annotation_provider.py

Garrafao / durel_system_annotators

Large parts of annotation provider unnecessary? #43