UUDigitalHumanitieslab / placenamedisambiguation

A pipeline of scripts that enables disambiguation of place names in a given corpus
MIT License
0 stars 0 forks source link

evaluate multiNER performance #12

Open alexhebing opened 5 years ago

alexhebing commented 5 years ago

After #17, #19 and #25, test the performance of multiNER with different configurations (e.g. type_preference, leading packages, maybe adjust how these work together (see also #15).

In any case, evaluate using the Dutch Historical corpus from the KB. If time permits (i.e. things go easy), also run some tests against the italian I-CAB corpus

alexhebing commented 5 years ago

Ok, I tried to use the parser from #22 to parse the KB corpus to plain txt, but I stumbled upon two problems:

jgonggrijp commented 5 years ago

There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have '000010472in their title. It crashes withUnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte`. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...

Check the offending character in the file in question. It's certainly not out of the question that the file is corrupted. I had to hand-correct a couple of files in the Times corpus, too.

alexhebing commented 5 years ago

Ok, I finally made it through all the files and have parsed the Golden Standard corpus from XML to TXT files. Pfff... About 1/4th of the 99 files contained the problem with non-decodable bytes, and I handled them all manually. Has this to do with OCR quality (e.g. extremely exotic characters?) or was a different encoding (than utf-8) used? If we ever establish contact again, ask Willem Jan from the KB about this. Anyhow, the cleaned .alto files are now in SurfDrive.

@jgonggrijp : is there a trick you use to handle this type of error in scripts dealing with large amounts of data? I am thinking along the lines of 1) Ignoring the file and store it somewhere or 2) make a copy of the file without the byte at position X and try to decode it again, keep going until it decodable. But that is probably going to make things even messier. Anything that you do in your scripts that I can take a look at?

jgonggrijp commented 5 years ago

@alexhebing I haven't tried fixing such problems automatically. I think it requires strong intelligence. If it can be done with weak intelligence, I don't know how.

alexhebing commented 5 years ago

MultiNER, to my great frustration, performed awfully against the Italian Golden Standard provided by Lorella. The best score was with a configuration with stanford as leading package and 2 as other packages_min:

stanford_2

Something is wrong with either multiNER, the bio_converter or the evaluation script, or all of them. Some issues that I found already are in #29 (this also includes a link to a multiNER issue). Fix these, tests some more, look for other issues.

For reference, here is the score that Spacy (run in isolation, i.e. separate from multiNER) got on the same Golden Standard (due to issues with the way Spacy processes the text you feed it, it only used 163 of the 190 files): quick_spacy

alexhebing commented 5 years ago

In addition, I ran a test with Stanford on the same Golden Standard. @BeritJanssen, look at the amazing score it gets (on 171 of the 190 files):

quick_stanford

If only multiner wouldn't contain bugs, surely it would score similarly, and probably better :cry:

BeritJanssen commented 5 years ago

Wow, that's a different story. Well, then use Stanford for this experiment, I would say... I see there is some occasional activity on the KB repo. Maybe you can make an issue with your screenshots there? It's good for them (and others) to know that there are some issues still with the output.

alexhebing commented 5 years ago

I forgot to mention (and think of) the fact that the model used for Italian was actually trained on this dataset. In that regard, the result are not that surprising.

I am in contact with the people at KB about a PR, but Willem Jan (if I remember his name correctly) is currently absent due to illness. The issues that we experience, however, are probably due to my extensive rewriting of the code.