LanguageMachines / frog

Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.
https://languagemachines.github.io/frog
GNU General Public License v3.0
75 stars 11 forks source link

[NER] More fine-grained set definition regarding locations #59

Open proycon opened 6 years ago

proycon commented 6 years ago

Currently the NER module in Frog distinguishes persons, locations, events, products(?) and miscellaneous.

Since the module has been enhanced with gazetteers, I think we can do better than this coarse division. Various named entities are perfectly enumerable; countries, cities, street names, postal codes, rivers, forests, mountains... and gazetteers serve well here; it would be a waste to lose this information by subsuming it all under "location". We already have a FoLiA set definition (https://github.com/proycon/folia/blob/master/setdefinitions/namedentities.foliaset.ttl) from a prior project that allows for a more fine-grained taxonomy regarding locations, which is compatible (i.e. a superset) with our current set.

Databases such as Geonames also contain this information, and we currently don't make use of it. I propose we migrate to a more fine-grained set (and include a few more gazetteers where possible). What do you think @kosloot @antalvdb @Irishx ?

Context: this is relevant for our 112-project (@HenkvdHeuvel), here we need to know whether a location is a street, city, etc.. I think we can include a lot of these gazetteer-based improvements in the Frog data itself, i.e. the generic dutch model (as it's not sensitive data)

(technicality: this is more of more of a frogdata issue than a Frog issue as such, but I guess it's more visible here)

kosloot commented 6 years ago

As far as i can see. The software itself doesn't impose restrictions. So this is indeed a data question. I did use a small part of Geonames to test, and it is usable. But there are a lot of ugly details to consider. The data can be polluted and (very) ambiguous. So using this data might need some investigation, and probably preprocessing.

proycon commented 6 years ago

Another good (secondary) source for location data is OpenStreetMap, I experimented with that yesterday. It's fairly easy to extract all streets and cities/towns.

proycon commented 6 years ago

This is also relevant for @Irishx (frog evaluation) and @HenkvdHeuvel (112 project), and perhaps @antalvdb:

Okay, things are a bit more complex. We have some ordering problems. The current situation:

[Kobus] als ambigue, dan wint de laatste gazet, denk ik denk dat het zo gaat: ALS geen timbl tag toegekend en WEL een gazet info bekend dan neem die

[Kobus] alles wordt in een grote hash gepropt laatste telt

I have a test sentence:

De Maas en de Waal stromen niet door Amsterdam, maar monden wel uit in de Noordzee

This results in four loc detections (from the context-based module), which is correct but doesn't make use of the gazetteers so we don't get any of the fine-grained categories, which was kind of the whole point of this exercise.

If I use a Frog trained on the much more limited model from the 112 project, the gazetteers do kick in and I now get:

There's a street named Waal, not surprising as there are streets named after pretty much everything so this should be get a lower priority. There's also a village called "Noordzee" apparently which happens to take precedence over loc.water.sea.

I'm trying to find the 'optimal' ordering for ners.known, which is tricky enough as there is always ambiguity and you can never get it really right, but I can't override the NER context-based module here which poses a bigger problem. It would help (feature request) if we had a parameter to set the context-based module to have the highest priority, lowest priority, or completely disable it (the latter case might be interesting if you want to rely on gazetteers only, for speed for instance which may be an important factor in the 112 project)

Opinions?

Do we want to merge the new gazetteers into frogdata master despite the problems (the new lists technically are superior, i.e. more complete, and categories more fine-grained). Or do we keep the old status quo for now?

Irishx commented 6 years ago

ik wil de NER graag testen met en zonder deze gazetteers om te zien wat het effect is.

proycon commented 6 years ago

Dat lijkt me een goed idee ja, je kan de gazetteers in ieder geval uitschakelen door zelf in de frog configuratie file, en ners.known te editen.

kosloot commented 1 year ago

@proycon and @Irishx Can we close this as "solved" for now? Or?

Irishx commented 1 year ago

hoi,

Ja dit kunnen we wel afsluiten.

Groetjes iris

Iris Hendrickx @.***

On 7 Mar 2023, at 09:20, Ko van der Sloot @.***> wrote:

@proycon https://github.com/proycon and @Irishx https://github.com/Irishx Can we close this as "solved" for now? Or?

— Reply to this email directly, view it on GitHub https://github.com/LanguageMachines/frog/issues/59#issuecomment-1457743372, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPWXGWAUUO2SK2AAZQOMX3W23VVHANCNFSM4FQY544A. You are receiving this because you were mentioned.