freme-project / freme-ner

Apache License 2.0
6 stars 1 forks source link

Different output for abbreviated terms #63

Closed Katsivelisp closed 7 years ago

Katsivelisp commented 8 years ago

Hello again,

in our case we have many examples of abbreviated terms in author records or abstract elements. If we run the NER service on two written versions of the same entity, we get different results. For example:

Is this something that can be resolved within the scope of FREME NER? Just off the top of my head: is there something like an "abbreviation" list where we could add cases like this ("Univ." = "University", "Inst." = "Institute" etc)?

m1ci commented 8 years ago

Do you want to link perform only linking? You submit name of an entity and ask for a link - in VIAF in this case?

Or, you submit text, and you want to first perform spotting, then linking?

Katsivelisp commented 8 years ago

Hi,

I'm sorry, I wasn't very clear on this. I'm mostly talking about annotating entities using the VIAF dataset.

We often stumble upon these cases as part of a bigger, abstract textual element. So, for example, "National University of Singapore" can be followed or preceded by author names, countries, postal codes etc. So yes, I suppose we would like to perform spotting first, then linking to the VIAF dataset.

I hope this helps.

m1ci commented 8 years ago

Thanks for clarification. Can you provide some more and full examples where spotting/linking does not work and such cases are important for you?

So, for example, "National University of Singapore" can be followed or preceded by author names, countries, postal codes etc.

Our entity spotting models have been trained on clean text which means regular sentences, correct grammar, correct spelling (including capitalization), correct punctuation marks, no HTML or XML markup, etc. So spotting might not work well on your texts. But first, we need to confirm and look for a solution. One solution is to develop a training dataset. For training a decent NER model we will need 10-15K sentences where in average a sentence contains about 13 tokens.

m1ci commented 8 years ago

What about parsing your input? Isn't there some usual pattern in the content of your documents? That would be the least painful solution.

Katsivelisp commented 8 years ago

I'm not sure what you mean with "parse". There is no universal parsing rule for such records in the XML documents that we process. When the organization records are too short, we often concatenate them with other elements, such as the city, country or country code.

A typical, wider scenario where we would like to apply FREME NER is in cases like the following:

Stoitsis, G.; University of Athens; Department of Informatics and Telecommunications; Athens; Attica; Greece; http://www.di.uoa.gr

This is handled just fine by FREME NER with the VIAF dataset option. The result links "University of Athens" with the respective entity correctly. But, there are cases like this:

Stoitsis, G.; Univ. of Athens; Department of Informatics and Telecommunications; Athens; Attica; Greece; http://www.di.uoa.gr

This doesn't bring up any results for the "University of Athens" entity. So, how do you think we should handle this?

jnehring commented 8 years ago

@Katsivelisp is this issue still up to date?

m1ci commented 7 years ago

inactive issue, closing it