dice-group / gerbil

GERBIL - General Entity annotatoR Benchmark
GNU Affero General Public License v3.0
219 stars 57 forks source link

Combine sameAs retrieval and entityChecking #134

Open MichaelRoeder opened 8 years ago

MichaelRoeder commented 8 years ago

At the moment, the sameAs retrieval as well as the entityChecking are done independently. However, if the sameAs retrieval was able to retrieve data for a given URI, the entity checking can be skipped. Thus, these two steps should be combined into one single step.

TortugaAttack commented 8 years ago

Do i understood it correct? The sameAs retrieves entities which are the same as the resource. And for these entities it will be checked if they exists. So the sameAs retrieval should immediately check if the entity exists instead of doing this after the sameAs retrieval.

So instead of the AbstractDataset the SameAsRetriever should "start" the EntityChecker and the problem should be solved... or did i miss something?

MichaelRoeder commented 8 years ago

It is a little bit more complicated (otherwise it would be too easy :wink: )

If a dataset is loaded it is preprocessed by

  1. go through all entities and try to extend their URI set by retrieving owl:sameAs links.
  2. go through all URIs of all entities and check whether their URI exists.

However, the first step already checks whether an entity exists since it can not retrieve information about an entity if it is not existing. Thus, I would like to combine both functionalities into one single preprocessing.

There are some additional requirements that have to be taken into account.

You might want to take a look into the sameas and the check packages as well as their JUnit tests to get a better understanding. After that you should think about the structure the preprocessing should have and how it can fulfill the requirements. We can discuss about that if you want.

Cheers, Micha

MichaelRoeder commented 8 years ago

According to a comment in #137 this preprocessing should only be done, if it is needed for the experiment. But this can be added later on, after the refactoring described above is done.