Open EmilStenstrom opened 8 years ago
I tried the same with an english article (http://www.theguardian.com/world/2016/apr/02/greece-violence-refugees-riot-forced-return-to-turkey) and got the same result. Punctuation and lowercase words that doesn't seem to fit in at all:
... I-LOC ['Greece'] I-PER ['Kyritsis'] I-PER ['“'] I-PER ['haven’t'] I-LOC ['Greece'] I-LOC ['Chios'] I-LOC ['Turkey'] I-PER ['”'] I-PER ['Mustafa'] I-LOC ['Chios'] I-ORG ['Agence', 'France'] I-ORG ['Presse'] I-PER ['“'] I-PER ['”'] I-PER ['Benjamin', 'Julian'] I-LOC ['“'] I-LOC ['Turkey'] I-LOC ['Piraeus'] I-LOC ['Athens', '’'] I-LOC ['Lesbos'] I-LOC ['Idomeni'] I-LOC ['Macedonia'] I-LOC ['EU'] I-LOC ['Turkey'] I-LOC ['Europe'] I-LOC ['Greece'] I-LOC ['Athens'] ...
Any ideas of how to improve these results further?
I confirm this issue, check this notebook that reproduce the problem https://gist.github.com/anonymous/093e06f98a9c8b7de963a6fef6f15ceb
Unfortunately, nothing obvious crosses my mind into how to fix this problem. Our models are not trained directly on human annotated data, instead we rely on weak annotations provided from wikipedia. Our paper is available over here
Thanks for the link to your paper. If I understand it correctly you use wikipedia links to extract entities. Could it be that this "link extraction" has a bug that makes it include quotes and dots in the actual entity? Is that code available somewhere?
Since a fairly large number of the entities are incorrect as a result of this I think the accuracy of NER would greatly increase if we find a way to fix this. Let me know if I can help somehow!
It would be really useful to have a link to the tooling that generates the embeddings so the community can contribute. Wikipedia boilerplate removal and data enrichment can be a pain and little noises in embeddings can generate a lot of issues.
I stumbled upon polyglot today when looking for a tool to extract entities from multi-lingual metadata on Wikimedia Commons. Polyglot looks amazingly useful for a lot of the tasks Wikipedians do now and increasingly often will do in the future when migrating metadata from the many wikiprojects to Wikidata. I currently work for Wikimedia Sweden and would like to help out improving the generation of the embeddings. Lots of synergies for whole community I'd say!
@aboSamoor: Any idea how we can help out here?
I just found polyglot, which seems to be FANTASTIC for dealing with all sorts of NLP problems in multiple languages. I want to use it for Swedish texts so I got to work and tested it on some real world texts.
Here's a random swedish article: http://www.dn.se/nyheter/sverige/oligarken-som-ager-en-o-i-stockholm/ I manually copy-pasted the text into a txt file, downloaded the required swedish models and tried to get the NER tags from it.
Here's a part of the ouput: ... I-LOC ['Stockholm'] I-LOC ['slovakisk'] I-PER ['Frantisek'] I-PER ['Jules', 'Verne'] I-PER ['Zvrskovec'] I-LOC ['Lidingö'] I-PER ['Bilspedition'] I-LOC ['Tjeckien'] I-PER ['oligarken'] I-LOC ['Indiana'] I-PER ['Indiana', 'Jones'] I-ORG ['Arlanda'] I-LOC ['Tjeckien'] I-LOC ['Stockholm'] I-PER ['Frantisek', 'Zvrskovec'] I-PER ['.'] I-PER ['Frantisek', 'Zvrskovec'] I-PER ['bottenskrevan'] I-LOC ['Stockholms'] I-PER ['landstigningsförbud'] I-PER ['helstängt'] I-PER ['Magnus', 'Hallgren'] I-PER ['Dividend'] I-ORG ['Central', 'Europe'] I-PER ['Zvrskovec'] I-PER ['.'] I-LOC ['Tjeckoslovakien'] I-LOC ['Dolny'] I-PER ['Dolny', 'Kubin'] I-PER ['.'] ...
...which is ok, but two things stand out:
I guess I could write my own filter to remove punctuation and lowercase words, but this seems should be easier to solve in an earlier step, when training the models, don't you think?