dstl / baleen

Entity Extraction Text Processor
Apache License 2.0
148 stars 40 forks source link

HTML5 output chokes on elements with newlines #41

Closed SJBertram closed 7 years ago

SJBertram commented 7 years ago

Sometimes, the HTML5 output will contain a visible HTML string in the form:

…most-of-entity-name" data-referent="" >start of entity
most-of-entity-name …

This is in the HTML source as " data-referent="" >.

From a few simple tests, this appears to happen when the tagged element contains a line-break (and hence the HTML5 output breaks it across paragraphs).

Using part of the NIST IE-ER data set (ieer-short.txt) and running it through a pipeline that uses OpenNLP results in ieer-short.html.txt.

Expected behaviour in this case is that National Convention Assembly is correctly tagged in the output without broken HTML.

jbaker-dstl commented 7 years ago

I think this could be as much an issue with HTML as a format as opposed to Baleen. In HTML, tags can't overlap, so the following wouldn't be valid:

[p][span]National[/p][p]Convention Assembly[/span][/p]

So it's not immediately clear what we should do when trying to annotate entities that span multiple paragraphs. I don't think we should remove the paragraph break as we should try to preserve the original format (there's some improvements coming in 2.4 that will do a better job of extracting the correct original formatting). Perhaps we should put in two sets of tags, such as:

[p][span]National[/span][/p] [p][span]Convention Assembly[/span][/p]

But this potentially causes other problems with determining if it is a single entity or two separate entities.

I'm open to ideas as to what the best way forwards might be. The HTML5 consumer was only ever really meant as a means of debugging output. If people are using it for other purposes, we should perhaps consider rewriting it to ensure it is more robust against these sorts of edge cases.

SJBertram commented 7 years ago

We're using HTML5 because it appears to be the only way to get a tagged-up document on disk, which we're using for comparing Baleen's accuracy to a known-accurate ground truth while having context.

I know HTML tags can't overlap, but I still feel that it shouldn't result in mangled HTML-appearing-as-text. That implies that something is creating invalid HTML and then trying to tidy it up/escape it (badly).

The added problem is that the paragraph tags in the HTML 5 output are not always there in the original input. I've not got an example to hand, but I'm fairly sure I've seen the Tika parsing limit column widths/add line wrapping that isn't in the original text. That can result in the text being wrapped (and incorrect HTML being generated) even when the original text was good.

Also, the problem with that approach is that line-breaks in text doesn't always mean paragraphs. Just look at the plain-text version of Project Gutenberg books - they're wrapped at a fixed width to make them readable, but it doesn't imply paragraphs. The annotators seem to parse sentences irrespective of white-space, so you can't really assume that a new line is a new paragraph.

The simplest solution would be to wrap everything in a single <div> tag and use white-space: pre (or even white-space: pre-line, since it is HTML5 and you're presumably targeting later standards, so IE8+ support shouldn't be a problem)

jbaker-dstl commented 7 years ago

That sounds like a very sensible solution to me, and I've made the changes. I was unable to reproduce the issue you were having so I can't say for definite that it's been fixed, but certainly spans can now go over line breaks, so it's an improvement at any rate.

The change will be included in the next release.

SJBertram commented 7 years ago

It seems to happen reliably for me when I run the previously attached IE-ER data through the following pipeline:

collectionreader:
  class: FolderReader
  folders:
  - documents

annotators:
- language.OpenNLP
- language.OpenNLPParser
- class: stats.OpenNLP
  model: en-ner-person.bin
  type: Person
- class: stats.OpenNLP
  model: en-ner-organization.bin
  type: Organisation
- class: stats.OpenNLP
  model: en-ner-location.bin
  type: Location
- cleaners.RemoveLowConfidenceEntities
- cleaners.AddGenderToPerson
- cleaners.EntityInitials
- coreference.SieveCoreference
- cleaners.CorefBrackets
- cleaners.CorefCapitalisationAndApostrophe

consumers:
  - class: Html5
    outputFolder: output/opennlp/
    css: ../../annotate.css
  - class: csv.Coreference
    filename: output/opennlp/coreference.csv
jbaker-dstl commented 7 years ago

Addressed in Baleen 2.4