Open fsasaki opened 8 years ago
Thank you for reporting this. I'll have a look as soon as possible.
I've found out that the problem lies in the nu.validator.htmlparser.dom.HtmlDocumentBuilder
. When parsing this document, it returns a hierarchy of nodes where
This means that the parsed file is interpreted as the following:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>dummy title</title>
</head>
<body>
<p id="n7"><span class="test1">Welcome to the city of <span class="test2"
>Berlin!</span></span></p>
</body>
</html>
So this cannot be fixed in e-Internationalization. Suggestions are very welcome.
Thanks for checking, @borriellom . I'll see if @sideshowbarker has any suggestions what to do.
So to be clear, the input source being discussed here is this:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>dummy title</title>
</head>
<body>
<p id="n7"><span class="test1"></span>Welcome to the city of <span class="test2"
></span>Berlin!</p>
</body>
</html>
I've found out that the problem lies in the
nu.validator.htmlparser.dom.HtmlDocumentBuilder
. When parsing this document, it returns a hierarchy of nodes where
- the text "Welcome to the city of " is child of the first
<span>
element;- the second
<span>
element is child of the first<span>
element;- the text "Berlin!" is child of the second
<span>
element.
If so, that’s an unintentional bug in nu.validator.htmlparser.dom.HtmlDocumentBuilder
. But I am somewhat doubtful it’s actually doing what’s described above.
This means that the parsed file is interpreted as the following:
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>dummy title</title> </head> <body> <p id="n7"><span class="test1">Welcome to the city of <span class="test2" >Berlin!</span></span></p> </body> </html>
That doesn’t match the behavior of the htmlparser code that I see in the W3C validator where it’s used, nor in the Firefox gecko engine where it’s used.
cc’ing @hsivonen (the author of the code) to see if he might have time to provide any insights
Maybe the following could be helpful in understanding what's going on. I simply invoked the "parse" method from the HtmlDocumentBuilder class by passing as parameter the file provided by Felix. By inspecting the document returned by the method, I found that the node hierarchy was wrong and it is exactly what I described in the comment above. In more detail, in the right hierarchy the span nodes shouldn't have any children node. In the actual returned hierarchy they have children instead. The htmlparser version used by Okapi is 1.4.
To check this further I tried the standalone version of the HTML parser, see https://about.validator.nu/htmlparser/ https://about.validator.nu/htmlparser/htmlparser-1.4.zip calling the parser with the input file e.g. via java -cp htmlparser-1.4.jar nu.validator.htmlparser.tools.HTML2HTML in.html gives me the output that @sideshowbarker described. @borriellom , maybe the reason for the differences is the way to call the parser, though I don't know what to change in the e-internationalisation code.
I apologize, I've realized I was trying with the following input:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>dummy title</title>
</head>
<body>
<p id="n7"><span class="test1" />Welcome to the city of <span class="test2"
/>Berlin!</p>
</body>
</html>
In this case the behaviour is that I described in my previous comments. With the input from Felix it works. I'll continue investigating to spot the error.
The fix is good, but it had side effects: things like <meta>
are now output <meta></meta>
, and that is invalid HTML5. Void elements such as <meta>
must not have an end-tag (See https://www.w3.org/TR/html-markup/syntax.html#void-element)
One solution is to close the empty element with />
instead of </elemname>
, I think that would be Ok since a start-tag is defined as a tag that can have "Optionally, a "/" character, which may be present only if the element is a void element."
But then the case of elements like <span></span>
will be output as <span/>
which would be OK from the HTML5 viewpoint, but then since <span/>
seems to not be read properly as an empty element by the parser, we may introduce an issue for the next time the file is parsed.
The alternative is to use />
for void elements and </elemname>
for anything else. I'll go with that, except if anyone has a better idea, or a fix for the <span/>
problem.
On a second thought going for <meta/>
may not be very nice as it'll change (without good reason) the syntax of many files. So we could just do <elemname>
for void elements.
I would suggest to have a check for the element name. If it is in the list of void elements, have no no end tag; if not, have the end tag. So you can differentate <meta> from <span></span>. In that way the output would be "clean" HTML5 syntax. The code gets a bit longer but not that much - https://www.w3.org/TR/html-markup/syntax.html#void-element is a small list after all.
I did as suggested by Felix. The change is here: https://bitbucket.org/okapiframework/okapi/commits/b057427ef030668fc35f0f6ec68ad6d77f5859bc#chg-okapi/filters/its/src/main/java/net/sf/okapi/filters/its/ITSFilter.java and the maven snapshot repository should get re-build in about 12 hours.
Yes, there is an occurrence of the its-ta-class-ref
attribute for each value returned by the FREME NER service. Is the syntax not correct? Please, let me know how attributes with multiple values should be written, and I'll change the code properly as soon as possible.
Thanks for the explanation in #11, Marta. The ITS spec does not specify this, unfortunately. I would suggest to write just one attribute and have in the attribute a space separated list of values.
Thank you Felix, I'll do that.
The issue is fixed (both empty elements and multiple valued attributes). I've just committed the changes on master branch. I also edited the POM file, so that now it points to the last Okapi SNAPSHOT version and we can use Yves' changes. Many thanks to @ysavourel, now empty elements in HTML5 are properly managed by Okapi.
Great, thanks a lot, @borriellom and @ysavourel .
@borriellom , I just tried this and still got a strange behaviour of empty elements and duplication of attributes. Is my request
curl -X POST --header "Content-Type:text/html" -d @temp/tei-as-html.html "http://api.freme-project.eu/current/e-entity/freme-ner/documents?informat=text%2Fhtml&outformat=text%2Fhtml&language=fr&dataset=dbpedia&mode=all" > temp/tei-as-html-enriched.html
wrong? See attached input file (tei-as-html.html) and the output files generated via above request = the dev. api (tei-as-html-enriched-with-api-dev.txt) and via the 0.4 version api (tei-as-html-enriched.txt). tei-as-html-enriched-with-api-dev.txt tei-as-html-enriched.txt tei-as-html.txt
I tried to process your file with my code and it worked fine. I think that the new version has not been deployed yet. I thought that the master branch was built and deployed each time we commit changes, but maybe I was wrong. Please, try again when you are sure that the new version has been deployed.
Thanks, Marta, will do.
I thought that the master branch was built and deployed each time we commit changes, but maybe I was wrong. Please, try again when you are sure that the new version has been deployed.
Indeed, but the build failed; the dependency on Okapi was not resolved. For details, see http://rv1443.1blu.de:8082/job/e-Internationalization/28/ (you can also inspect the console output there), for a general overview of FREME builds, see http://rv1443.1blu.de:8082/.
Yes, maybe a build of Okapi dev is needed for including the last okapi snapshot in the local maven repository as described by @ghsnd in #11
As a temporary solution I can restore the pom.xml file. So the build won't fail and all multiple valued attributes will be well managed. The empty tags won't be handled properly until the newest version of Okapi is linked to the project. Let me know how to proceed.
Makes sense to me. About "a build of Okapi dev is needed": who can initiate that?
2016-01-28 10:16 GMT+01:00 borriellom notifications@github.com:
As a temporary solution I can restore the pom.xml file. So the build won't fail and all multiple valued attributes will be well managed. The empty tags won't be handled properly until the newest version of Okapi is linked to the project. Let me know how to proceed.
— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Internationalization/issues/30#issuecomment-176077413 .
I think the simplest solution is to build Okapi dev just once manually, and install it in the maven repo on the FREME server(s). Maybe @jnehring can give some advice here?
The fix is in 0.30-SNAPSHOT, so it'll be in http://repository-okapi.forge.cloudbees.com/snapshot until the next release.
I think the simplest solution is to build Okapi dev just once manually, and install it in the maven repo on the FREME server(s). Maybe @jnehring can give some advice here?
I added the Okapi snapshot repository to e-Internationalizations pom.xml. With this update everyone can build e-Internationalization without building Okapi manually. It downloads the dependency from the maven repository. This update also fixes the build problems on our Jenkins server.
We want to do a release on friday. It is problematic to include a snapshot dependency in the release because it is not possible to repeat the release. When we want to rebuild this release in a few months then we get a different result because the snapshot artifact has changed or does not exist anymore. But I think this is ok for now. Usually the release would fail with a snapshot dependency but we can hack or way around this issue.
@borriellom can we close this issue?
@borriellom made a fix on multiple values, see this discussion https://github.com/freme-project/e-Internationalization/issues/30#issuecomment-175502075 on multiple values. The fix I proposed https://github.com/freme-project/e-Internationalization/issues/30#issuecomment-175528125 is not allowed by the ITS2 spec and is reported as an error by the W3C validator. @borriellom and @jnehring , if there is still time I would then propose a different fix: if there are multiple values then (and only then) add a data-its-ta-ident-refs attribute which contains the multiple values, and have one attribute its-ta-ident-ref that has one value. e.g. instead of
its-ta-class-ref="http://nerd.eurecom.fr/ontology#Person http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Agent http://dbpedia.org/ontology/Governor http://dbpedia.org/ontology/Politician"
have
its-ta-class-ref="http://dbpedia.org/ontology/Politician" data-its-ta-class-refs="http://nerd.eurecom.fr/ontology#Person http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Agent http://dbpedia.org/ontology/Governor http://dbpedia.org/ontology/Politician"
Sounds like a good solution to me. @borriellom is there still time to implement this in the next days? We want to release on friday.
Done. Please, @fsasaki check if it works as expected.
Works perfect, thanks! One question: does it also happen for the its-ta-ident-ref that you have several URIs? In that case I would suggest the same mechanism. If that, I'm fine with the current state.
Empty HTML elements are not handled properly. See attached input and output. The span elements are empty in the input and in the output they contain text content. I browsed the code, the issue may be here https://github.com/freme-project/e-Internationalization/blob/master/src/main/java/eu/freme/i18n/okapi/nif/converter/HTMLBackConverter.java This may not have handling of empty elements.
input.txt output.txt