Closed SJBertram closed 7 years ago
I think this could be as much an issue with HTML as a format as opposed to Baleen. In HTML, tags can't overlap, so the following wouldn't be valid:
[p][span]National[/p][p]Convention Assembly[/span][/p]
So it's not immediately clear what we should do when trying to annotate entities that span multiple paragraphs. I don't think we should remove the paragraph break as we should try to preserve the original format (there's some improvements coming in 2.4 that will do a better job of extracting the correct original formatting). Perhaps we should put in two sets of tags, such as:
[p][span]National[/span][/p] [p][span]Convention Assembly[/span][/p]
But this potentially causes other problems with determining if it is a single entity or two separate entities.
I'm open to ideas as to what the best way forwards might be. The HTML5 consumer was only ever really meant as a means of debugging output. If people are using it for other purposes, we should perhaps consider rewriting it to ensure it is more robust against these sorts of edge cases.
We're using HTML5 because it appears to be the only way to get a tagged-up document on disk, which we're using for comparing Baleen's accuracy to a known-accurate ground truth while having context.
I know HTML tags can't overlap, but I still feel that it shouldn't result in mangled HTML-appearing-as-text. That implies that something is creating invalid HTML and then trying to tidy it up/escape it (badly).
The added problem is that the paragraph tags in the HTML 5 output are not always there in the original input. I've not got an example to hand, but I'm fairly sure I've seen the Tika parsing limit column widths/add line wrapping that isn't in the original text. That can result in the text being wrapped (and incorrect HTML being generated) even when the original text was good.
Also, the problem with that approach is that line-breaks in text doesn't always mean paragraphs. Just look at the plain-text version of Project Gutenberg books - they're wrapped at a fixed width to make them readable, but it doesn't imply paragraphs. The annotators seem to parse sentences irrespective of white-space, so you can't really assume that a new line is a new paragraph.
The simplest solution would be to wrap everything in a single <div>
tag and use white-space: pre
(or even white-space: pre-line
, since it is HTML5 and you're presumably targeting later standards, so IE8+ support shouldn't be a problem)
That sounds like a very sensible solution to me, and I've made the changes. I was unable to reproduce the issue you were having so I can't say for definite that it's been fixed, but certainly spans can now go over line breaks, so it's an improvement at any rate.
The change will be included in the next release.
It seems to happen reliably for me when I run the previously attached IE-ER data through the following pipeline:
collectionreader:
class: FolderReader
folders:
- documents
annotators:
- language.OpenNLP
- language.OpenNLPParser
- class: stats.OpenNLP
model: en-ner-person.bin
type: Person
- class: stats.OpenNLP
model: en-ner-organization.bin
type: Organisation
- class: stats.OpenNLP
model: en-ner-location.bin
type: Location
- cleaners.RemoveLowConfidenceEntities
- cleaners.AddGenderToPerson
- cleaners.EntityInitials
- coreference.SieveCoreference
- cleaners.CorefBrackets
- cleaners.CorefCapitalisationAndApostrophe
consumers:
- class: Html5
outputFolder: output/opennlp/
css: ../../annotate.css
- class: csv.Coreference
filename: output/opennlp/coreference.csv
Addressed in Baleen 2.4
Sometimes, the HTML5 output will contain a visible HTML string in the form:
This is in the HTML source as
" data-referent="" >
.From a few simple tests, this appears to happen when the tagged element contains a line-break (and hence the HTML5 output breaks it across paragraphs).
Using part of the NIST IE-ER data set (ieer-short.txt) and running it through a pipeline that uses OpenNLP results in ieer-short.html.txt.
Expected behaviour in this case is that National Convention Assembly is correctly tagged in the output without broken HTML.