freme-project / e-Internationalization

Apache License 2.0
0 stars 0 forks source link

Handling of empty elements during roundtripping #30

Open fsasaki opened 8 years ago

fsasaki commented 8 years ago

Empty HTML elements are not handled properly. See attached input and output. The span elements are empty in the input and in the output they contain text content. I browsed the code, the issue may be here https://github.com/freme-project/e-Internationalization/blob/master/src/main/java/eu/freme/i18n/okapi/nif/converter/HTMLBackConverter.java This may not have handling of empty elements.

input.txt output.txt

borriellom commented 8 years ago

Thank you for reporting this. I'll have a look as soon as possible.

borriellom commented 8 years ago

I've found out that the problem lies in the nu.validator.htmlparser.dom.HtmlDocumentBuilder. When parsing this document, it returns a hierarchy of nodes where

This means that the parsed file is interpreted as the following:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>dummy title</title>
  </head>
  <body>
    <p id="n7"><span class="test1">Welcome to the city of <span class="test2"
      >Berlin!</span></span></p>
  </body>
</html>

So this cannot be fixed in e-Internationalization. Suggestions are very welcome.

fsasaki commented 8 years ago

Thanks for checking, @borriellom . I'll see if @sideshowbarker has any suggestions what to do.

sideshowbarker commented 8 years ago

So to be clear, the input source being discussed here is this:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>dummy title</title>
  </head>
  <body>
    <p id="n7"><span class="test1"></span>Welcome to the city of <span class="test2"
      ></span>Berlin!</p>
  </body>
</html>

I've found out that the problem lies in the nu.validator.htmlparser.dom.HtmlDocumentBuilder. When parsing this document, it returns a hierarchy of nodes where

  • the text "Welcome to the city of " is child of the first <span> element;
  • the second <span> element is child of the first <span> element;
  • the text "Berlin!" is child of the second <span> element.

If so, that’s an unintentional bug in nu.validator.htmlparser.dom.HtmlDocumentBuilder. But I am somewhat doubtful it’s actually doing what’s described above.

This means that the parsed file is interpreted as the following:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <title>dummy title</title>
 </head>
 <body>
   <p id="n7"><span class="test1">Welcome to the city of <span class="test2"
     >Berlin!</span></span></p>
 </body>
</html>

That doesn’t match the behavior of the htmlparser code that I see in the W3C validator where it’s used, nor in the Firefox gecko engine where it’s used.

cc’ing @hsivonen (the author of the code) to see if he might have time to provide any insights

borriellom commented 8 years ago

Maybe the following could be helpful in understanding what's going on. I simply invoked the "parse" method from the HtmlDocumentBuilder class by passing as parameter the file provided by Felix. By inspecting the document returned by the method, I found that the node hierarchy was wrong and it is exactly what I described in the comment above. In more detail, in the right hierarchy the span nodes shouldn't have any children node. In the actual returned hierarchy they have children instead. The htmlparser version used by Okapi is 1.4.

fsasaki commented 8 years ago

To check this further I tried the standalone version of the HTML parser, see https://about.validator.nu/htmlparser/ https://about.validator.nu/htmlparser/htmlparser-1.4.zip calling the parser with the input file e.g. via java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.HTML2HTML in.html gives me the output that @sideshowbarker described. @borriellom , maybe the reason for the differences is the way to call the parser, though I don't know what to change in the e-internationalisation code.

borriellom commented 8 years ago

I apologize, I've realized I was trying with the following input:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>dummy title</title>
  </head>
  <body>
    <p id="n7"><span class="test1" />Welcome to the city of <span class="test2"
      />Berlin!</p>
  </body>
</html>

In this case the behaviour is that I described in my previous comments. With the input from Felix it works. I'll continue investigating to spot the error.

ysavourel commented 8 years ago

The fix is good, but it had side effects: things like <meta> are now output <meta></meta>, and that is invalid HTML5. Void elements such as <meta> must not have an end-tag (See https://www.w3.org/TR/html-markup/syntax.html#void-element)

One solution is to close the empty element with /> instead of </elemname>, I think that would be Ok since a start-tag is defined as a tag that can have "Optionally, a "/" character, which may be present only if the element is a void element."

But then the case of elements like <span></span> will be output as <span/> which would be OK from the HTML5 viewpoint, but then since <span/> seems to not be read properly as an empty element by the parser, we may introduce an issue for the next time the file is parsed.

The alternative is to use /> for void elements and </elemname> for anything else. I'll go with that, except if anyone has a better idea, or a fix for the <span/> problem.

ysavourel commented 8 years ago

On a second thought going for <meta/> may not be very nice as it'll change (without good reason) the syntax of many files. So we could just do <elemname> for void elements.

fsasaki commented 8 years ago

I would suggest to have a check for the element name. If it is in the list of void elements, have no no end tag; if not, have the end tag. So you can differentate <meta> from <span></span>. In that way the output would be "clean" HTML5 syntax. The code gets a bit longer but not that much - https://www.w3.org/TR/html-markup/syntax.html#void-element is a small list after all.

ysavourel commented 8 years ago

I did as suggested by Felix. The change is here: https://bitbucket.org/okapiframework/okapi/commits/b057427ef030668fc35f0f6ec68ad6d77f5859bc#chg-okapi/filters/its/src/main/java/net/sf/okapi/filters/its/ITSFilter.java and the maven snapshot repository should get re-build in about 12 hours.

fsasaki commented 8 years ago

Same question as in #11. I tried to check the state of #30 with the attached input and the attached request. The attached output now also has the issue that attributes are dublicated. @borriellom , could you have a look? in.txt out.txt run.txt

borriellom commented 8 years ago

Yes, there is an occurrence of the its-ta-class-ref attribute for each value returned by the FREME NER service. Is the syntax not correct? Please, let me know how attributes with multiple values should be written, and I'll change the code properly as soon as possible.

fsasaki commented 8 years ago

Thanks for the explanation in #11, Marta. The ITS spec does not specify this, unfortunately. I would suggest to write just one attribute and have in the attribute a space separated list of values.

borriellom commented 8 years ago

Thank you Felix, I'll do that.

borriellom commented 8 years ago

The issue is fixed (both empty elements and multiple valued attributes). I've just committed the changes on master branch. I also edited the POM file, so that now it points to the last Okapi SNAPSHOT version and we can use Yves' changes. Many thanks to @ysavourel, now empty elements in HTML5 are properly managed by Okapi.

fsasaki commented 8 years ago

Great, thanks a lot, @borriellom and @ysavourel .

fsasaki commented 8 years ago

@borriellom , I just tried this and still got a strange behaviour of empty elements and duplication of attributes. Is my request

curl -X POST --header "Content-Type:text/html" -d @temp/tei-as-html.html "http://api.freme-project.eu/current/e-entity/freme-ner/documents?informat=text%2Fhtml&outformat=text%2Fhtml&language=fr&dataset=dbpedia&mode=all" > temp/tei-as-html-enriched.html

wrong? See attached input file (tei-as-html.html) and the output files generated via above request = the dev. api (tei-as-html-enriched-with-api-dev.txt) and via the 0.4 version api (tei-as-html-enriched.txt). tei-as-html-enriched-with-api-dev.txt tei-as-html-enriched.txt tei-as-html.txt

borriellom commented 8 years ago

I tried to process your file with my code and it worked fine. I think that the new version has not been deployed yet. I thought that the master branch was built and deployed each time we commit changes, but maybe I was wrong. Please, try again when you are sure that the new version has been deployed.

fsasaki commented 8 years ago

Thanks, Marta, will do.

ghsnd commented 8 years ago

I thought that the master branch was built and deployed each time we commit changes, but maybe I was wrong. Please, try again when you are sure that the new version has been deployed.

Indeed, but the build failed; the dependency on Okapi was not resolved. For details, see http://rv1443.1blu.de:8082/job/e-Internationalization/28/ (you can also inspect the console output there), for a general overview of FREME builds, see http://rv1443.1blu.de:8082/.

borriellom commented 8 years ago

Yes, maybe a build of Okapi dev is needed for including the last okapi snapshot in the local maven repository as described by @ghsnd in #11

borriellom commented 8 years ago

As a temporary solution I can restore the pom.xml file. So the build won't fail and all multiple valued attributes will be well managed. The empty tags won't be handled properly until the newest version of Okapi is linked to the project. Let me know how to proceed.

fsasaki commented 8 years ago

Makes sense to me. About "a build of Okapi dev is needed": who can initiate that?

2016-01-28 10:16 GMT+01:00 borriellom notifications@github.com:

As a temporary solution I can restore the pom.xml file. So the build won't fail and all multiple valued attributes will be well managed. The empty tags won't be handled properly until the newest version of Okapi is linked to the project. Let me know how to proceed.

— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Internationalization/issues/30#issuecomment-176077413 .

ghsnd commented 8 years ago

I think the simplest solution is to build Okapi dev just once manually, and install it in the maven repo on the FREME server(s). Maybe @jnehring can give some advice here?

ysavourel commented 8 years ago

The fix is in 0.30-SNAPSHOT, so it'll be in http://repository-okapi.forge.cloudbees.com/snapshot until the next release.

jnehring commented 8 years ago

I think the simplest solution is to build Okapi dev just once manually, and install it in the maven repo on the FREME server(s). Maybe @jnehring can give some advice here?

I added the Okapi snapshot repository to e-Internationalizations pom.xml. With this update everyone can build e-Internationalization without building Okapi manually. It downloads the dependency from the maven repository. This update also fixes the build problems on our Jenkins server.

We want to do a release on friday. It is problematic to include a snapshot dependency in the release because it is not possible to repeat the release. When we want to rebuild this release in a few months then we get a different result because the snapshot artifact has changed or does not exist anymore. But I think this is ok for now. Usually the release would fail with a snapshot dependency but we can hack or way around this issue.

@borriellom can we close this issue?

fsasaki commented 8 years ago

@borriellom made a fix on multiple values, see this discussion https://github.com/freme-project/e-Internationalization/issues/30#issuecomment-175502075 on multiple values. The fix I proposed https://github.com/freme-project/e-Internationalization/issues/30#issuecomment-175528125 is not allowed by the ITS2 spec and is reported as an error by the W3C validator. @borriellom and @jnehring , if there is still time I would then propose a different fix: if there are multiple values then (and only then) add a data-its-ta-ident-refs attribute which contains the multiple values, and have one attribute its-ta-ident-ref that has one value. e.g. instead of

its-ta-class-ref="http://nerd.eurecom.fr/ontology#Person http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Agent http://dbpedia.org/ontology/Governor http://dbpedia.org/ontology/Politician"

have

its-ta-class-ref="http://dbpedia.org/ontology/Politician" data-its-ta-class-refs="http://nerd.eurecom.fr/ontology#Person http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Agent http://dbpedia.org/ontology/Governor http://dbpedia.org/ontology/Politician"

jnehring commented 8 years ago

Sounds like a good solution to me. @borriellom is there still time to implement this in the next days? We want to release on friday.

borriellom commented 8 years ago

Done. Please, @fsasaki check if it works as expected.

fsasaki commented 8 years ago

Works perfect, thanks! One question: does it also happen for the its-ta-ident-ref that you have several URIs? In that case I would suggest the same mechanism. If that, I'm fine with the current state.