Open alexhebing opened 5 years ago
Ok, I tried to use the parser from #22 to parse the KB corpus to plain txt, but I stumbled upon two problems:
[x] Even after html.unescape()
some HTML entities remain in the XMLs. ElementTree breaks because of them. So far, I have identified ><&"
. These need to be removed somehow. @BeritJanssen : is there a trick for this that is utilized in I-Analyzer (or elsewhere that you know of)?
[x] There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have 000010472
in their title. It crashes with UnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte
. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...
There appears to be a problem with the encoding of the file in some cases. Thusfar, Python cannot read the files that have '000010472in their title. It crashes withUnicodeDecodeError: 'utf-8' codec can't decode byte 0x84 in position 113948: invalid start byte`. Strangely enough, it does work for the other files I tested with (urn=ddd_000010470_mpeg21_p002_alto.alto, urn=ddd_000010474_mpeg21_p002_alto.alto, urn=ddd_000011329_mpeg21_p002_alto.alto, urn=ddd_000014128_mpeg21_p001_alto.alto). What is happening here? Do some files have a different encoding...? That would be totally weird...
Check the offending character in the file in question. It's certainly not out of the question that the file is corrupted. I had to hand-correct a couple of files in the Times corpus, too.
Ok, I finally made it through all the files and have parsed the Golden Standard corpus from XML to TXT files. Pfff... About 1/4th of the 99 files contained the problem with non-decodable bytes, and I handled them all manually. Has this to do with OCR quality (e.g. extremely exotic characters?) or was a different encoding (than utf-8) used? If we ever establish contact again, ask Willem Jan from the KB about this. Anyhow, the cleaned .alto
files are now in SurfDrive.
@jgonggrijp : is there a trick you use to handle this type of error in scripts dealing with large amounts of data? I am thinking along the lines of 1) Ignoring the file and store it somewhere or 2) make a copy of the file without the byte at position X
and try to decode it again, keep going until it decodable. But that is probably going to make things even messier. Anything that you do in your scripts that I can take a look at?
@alexhebing I haven't tried fixing such problems automatically. I think it requires strong intelligence. If it can be done with weak intelligence, I don't know how.
MultiNER, to my great frustration, performed awfully against the Italian Golden Standard provided by Lorella. The best score was with a configuration with stanford as leading package and 2 as other packages_min
:
Something is wrong with either multiNER, the bio_converter or the evaluation script, or all of them. Some issues that I found already are in #29 (this also includes a link to a multiNER issue). Fix these, tests some more, look for other issues.
For reference, here is the score that Spacy (run in isolation, i.e. separate from multiNER) got on the same Golden Standard (due to issues with the way Spacy processes the text you feed it, it only used 163 of the 190 files):
In addition, I ran a test with Stanford on the same Golden Standard. @BeritJanssen, look at the amazing score it gets (on 171 of the 190 files):
If only multiner wouldn't contain bugs, surely it would score similarly, and probably better :cry:
Wow, that's a different story. Well, then use Stanford for this experiment, I would say... I see there is some occasional activity on the KB repo. Maybe you can make an issue with your screenshots there? It's good for them (and others) to know that there are some issues still with the output.
I forgot to mention (and think of) the fact that the model used for Italian was actually trained on this dataset. In that regard, the result are not that surprising.
I am in contact with the people at KB about a PR, but Willem Jan (if I remember his name correctly) is currently absent due to illness. The issues that we experience, however, are probably due to my extensive rewriting of the code.
After #17, #19 and #25, test the performance of multiNER with different configurations (e.g. type_preference, leading packages, maybe adjust how these work together (see also #15).
In any case, evaluate using the Dutch Historical corpus from the KB. If time permits (i.e. things go easy), also run some tests against the italian I-CAB corpus