aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Punctuation and bad lowercase words in NER result #50

Open EmilStenstrom opened 8 years ago

EmilStenstrom commented 8 years ago

I just found polyglot, which seems to be FANTASTIC for dealing with all sorts of NLP problems in multiple languages. I want to use it for Swedish texts so I got to work and tested it on some real world texts.

Here's a random swedish article: http://www.dn.se/nyheter/sverige/oligarken-som-ager-en-o-i-stockholm/ I manually copy-pasted the text into a txt file, downloaded the required swedish models and tried to get the NER tags from it.

from polyglot.text import Text
text = Text(open("test.txt").read())
for entity in text.entities:
    print(entity.tag, entity)

Here's a part of the ouput: ... I-LOC ['Stockholm'] I-LOC ['slovakisk'] I-PER ['Frantisek'] I-PER ['Jules', 'Verne'] I-PER ['Zvrskovec'] I-LOC ['Lidingö'] I-PER ['Bilspedition'] I-LOC ['Tjeckien'] I-PER ['oligarken'] I-LOC ['Indiana'] I-PER ['Indiana', 'Jones'] I-ORG ['Arlanda'] I-LOC ['Tjeckien'] I-LOC ['Stockholm'] I-PER ['Frantisek', 'Zvrskovec'] I-PER ['.'] I-PER ['Frantisek', 'Zvrskovec'] I-PER ['bottenskrevan'] I-LOC ['Stockholms'] I-PER ['landstigningsförbud'] I-PER ['helstängt'] I-PER ['Magnus', 'Hallgren'] I-PER ['Dividend'] I-ORG ['Central', 'Europe'] I-PER ['Zvrskovec'] I-PER ['.'] I-LOC ['Tjeckoslovakien'] I-LOC ['Dolny'] I-PER ['Dolny', 'Kubin'] I-PER ['.'] ...

...which is ok, but two things stand out:

  1. Lots of People tags are just punctuation. Is this a known bug in polyglot?
  2. All the lowercase words in the example above are actually just nouns, and not people (let me know if you need a translation of the words to make sense of them). Is that also a bug in polyglot?

I guess I could write my own filter to remove punctuation and lowercase words, but this seems should be easier to solve in an earlier step, when training the models, don't you think?

EmilStenstrom commented 8 years ago

I tried the same with an english article (http://www.theguardian.com/world/2016/apr/02/greece-violence-refugees-riot-forced-return-to-turkey) and got the same result. Punctuation and lowercase words that doesn't seem to fit in at all:

... I-LOC ['Greece'] I-PER ['Kyritsis'] I-PER ['“'] I-PER ['haven’t'] I-LOC ['Greece'] I-LOC ['Chios'] I-LOC ['Turkey'] I-PER ['”'] I-PER ['Mustafa'] I-LOC ['Chios'] I-ORG ['Agence', 'France'] I-ORG ['Presse'] I-PER ['“'] I-PER ['”'] I-PER ['Benjamin', 'Julian'] I-LOC ['“'] I-LOC ['Turkey'] I-LOC ['Piraeus'] I-LOC ['Athens', '’'] I-LOC ['Lesbos'] I-LOC ['Idomeni'] I-LOC ['Macedonia'] I-LOC ['EU'] I-LOC ['Turkey'] I-LOC ['Europe'] I-LOC ['Greece'] I-LOC ['Athens'] ...

EmilStenstrom commented 8 years ago

Any ideas of how to improve these results further?

aboSamoor commented 8 years ago

I confirm this issue, check this notebook that reproduce the problem https://gist.github.com/anonymous/093e06f98a9c8b7de963a6fef6f15ceb

Unfortunately, nothing obvious crosses my mind into how to fix this problem. Our models are not trained directly on human annotated data, instead we rely on weak annotations provided from wikipedia. Our paper is available over here

http://arxiv.org/abs/1410.3791

EmilStenstrom commented 8 years ago

Thanks for the link to your paper. If I understand it correctly you use wikipedia links to extract entities. Could it be that this "link extraction" has a bug that makes it include quotes and dots in the actual entity? Is that code available somewhere?

Since a fairly large number of the entities are incorrect as a result of this I think the accuracy of NER would greatly increase if we find a way to fix this. Let me know if I can help somehow!

tgalery commented 8 years ago

It would be really useful to have a link to the tooling that generates the embeddings so the community can contribute. Wikipedia boilerplate removal and data enrichment can be a pain and little noises in embeddings can generate a lot of issues.

mattiasostmar commented 8 years ago

I stumbled upon polyglot today when looking for a tool to extract entities from multi-lingual metadata on Wikimedia Commons. Polyglot looks amazingly useful for a lot of the tasks Wikipedians do now and increasingly often will do in the future when migrating metadata from the many wikiprojects to Wikidata. I currently work for Wikimedia Sweden and would like to help out improving the generation of the embeddings. Lots of synergies for whole community I'd say!

EmilStenstrom commented 8 years ago

@aboSamoor: Any idea how we can help out here?