HTML enrichment of tag attributes

borriellom commented 9 years ago

I have just realized about a problem with HTML files converted to NIF by Okapi. Some tag attributes (i.e. content in meta tag, alt in img tag) are considered as text units by Okapi and they are included in the context of the NIF file. Once the NIF file is enriched, when converting back to the original HTML file, it could happen that some attributes have been enriched, thus creating an invalid HTML file.

Example from the test related to issue #9 Original HTML

<meta name="keywords" content="VistaTEC, Localization, Translation, MT, Multilingual, Langauge, Review, Multilingual SEO,website globalization, Transcreation, Brand optimisation, Global, Content, Agile">

Enriched HTML

<meta content="<span its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">VistaTEC</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Localization_of_a_ring" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Localization</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Translation" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Translation</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Mountain_Time_Zone" its-ta-class-ref="http://nerd.eurecom.fr/ontology#Location">MT</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Multilingualism" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Multilingual</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/First_language" its-ta-class-ref="http://nerd.eurecom.fr/ontology#Location">Langauge</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Review" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Review</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Multilingualism" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Multilingual</span> SEO,website globalization, <span its-ta-ident-ref="http://dbpedia.org/resource/Transcreation" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Transcreation</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Brand" its-ta-class-ref="http://www.w3.org/2002/07/owl#Thing">Brand</span> optimisation, <span its-ta-ident-ref="http://dbpedia.org/resource/Global_Television_Network" its-ta-class-ref="http://nerd.eurecom.fr/ontology#Organization">Global</span>, <span its-ta-ident-ref="http://dbpedia.org/resource/Content_(media)" its-ta-class-ref="http://nerd.eurecom.fr/ontology#Organization">Content</span>, Agile" name="keywords">

At the moment I don't know a simple way to discern between actual text units and text from attributes. I'll try to find out.

ysavourel commented 9 years ago

I think you should be able to see where the text unit is coming from in the restype attribute. For example the restype value for the keywords content is probably x-content.

borriellom commented 9 years ago

Thank you Yves. I checked, but I always receive TextUnit events with type field set to null.

ysavourel commented 9 years ago

Ah, with the ITS/HTML5 Filter that is possibly right. (I was thinking about the default HTML Filter).

Well I guess we should modify the ITS/HTNL5 filter to try to fill in the type. It's curious that it is not done already. Maybe there is a reason.

ghsnd commented 8 years ago

For us, this method can be the place to fill in the type of the TextUnit tu, in case it's an attribute. This could just be a quick fix, not a solid solution.

ghsnd commented 8 years ago

So here's the method:

private String addAttributeTextUnit (Attr attr,
        boolean addToSkeleton)
    {
        String id = String.valueOf(++tuId);
        ITextUnit tu = new TextUnit(id, attr.getValue(), true, mimeType);

        // Deal with inline codes if needed
        if ( params.useCodeFinder ) {
            applyCodeFinder(tu.getSource().getFirstContent());
            //TODO: we could have target entries too in some case (ITS targePointer) so we should test for that
            // and process the target too if needed.
        }

        // Set the ITS context for this attribute and set the relevant properties
        // (we could use directly trav, but this allows to avoid many trav.getXYZ() calls)
        // Note however, that trav should be used in some cases where the context alone is not enough
        ContextItem ci = new ContextItem(
            (context.isEmpty() ? attr.getParentNode() : context.peek().node), trav, attr);
        processTextUnit(tu, ci, null, attr);

        queue.add(new Event(EventType.TEXT_UNIT, tu));
        if ( addToSkeleton ) skel.addReference(tu);
        return id;
    }

Especially the line:

ITextUnit tu = new TextUnit(id, attr.getValue(), true, mimeType);

creates a TextUnit without the type field filled in. Since we know it is from an attribute, we could do something like:

tu.setType("...");   // where "attribute" could be filled in. But I don't know the right type for attribute.

Then later, in eu.freme.i18n.okapi.nif.filter.NifWriterFilter this can be used to check if the text comes from an attribute and can be skipped.

But I have no "ready-to-use" code here.

ysavourel commented 8 years ago

The setType() call has been added (in the latest dev branch). See https://bitbucket.org/okapiframework/okapi/commits/92bd02d6b893c4c8bf825978a156dfb57aea4c83

ghsnd commented 8 years ago

The setType() call has been added (in the latest dev branch).

Thanks a lot!

ghsnd commented 8 years ago

OK, so here's how you can fix the issue.

install the dev version of Okapi

git clone https://bitbucket.org/okapiframework/okapi.git
git checkout dev
cd okapi
mvn -DskipTests clean install
cd okapi/filters
mvn -DskipTests install

This installs the latest dev branch on your local maven repo

Check out branch issue_11

Check out branch issue_11 of e-Internationalization. There you see modifications in the pom.xml and in eu.freme.i18n.okapi.nif.step.NifWriterStep. They fix the issue, but I leave it to the experts of Vistatec to review this!

failing test

The test eu.freme.i18n.okapi.api.EInternationalizationAPITest#testEInternationalizationAPIHTML will fail because the models are not the same: in the expected model, the text in the attribute is still present, while in the actual resulting model it is gone (which is good).

borriellom commented 8 years ago

I reviewed the changes made by @ghsnd. They are fine! In order to make round-tripping work, I made further changes in the NifSkeletonWriterFilter and NifSkeletonMarkerHelper. I also edited the expected file in the eu.freme.i18n.okapi.api.EInternationalizationAPITest#testEInternationalizationAPIHTML test.

Changes have been committed in the branch issue_11.

borriellom commented 8 years ago

I finally spotted the error! It was in okapi ITSFilter. In both methods addStartTagToSkeleton and addStartTagToFragment there was a line of code closing the tag only if it is not HTML5 and it doesn't have children.

tmp.append("<"
            + ((node.getPrefix()==null) ? "" : node.getPrefix()+":")
            + node.getLocalName());

/*  adding attributes */

if ( !isHTML5 && !node.hasChildNodes() ) tmp.append("/");
tmp.append(">");

In both methods I added an else branch for dealing with HTML5 tags having no children.

tmp.append("<"
            + ((node.getPrefix()==null) ? "" : node.getPrefix()+":")
            + node.getLocalName());

/*  adding attributes */

if ( !isHTML5 && !node.hasChildNodes() ) tmp.append("/");
else if (isHTML5 && !node.hasChildNodes()) {
            tmp.append("></"
                    + ((node.getPrefix()==null) ? "" : node.getPrefix()+":")
                    + node.getLocalName());
        }
tmp.append(">");

I built Okapi with my changes and tried to execute the round-tripping on the document provided by Felix and it worked.

I hope @ysavourel could have a look at this code. If it is good, can it be included in the next Okapi release? Please, let me know if you need further information.

ysavourel commented 8 years ago

Will do. Thanks for debugging and finding the error.

fsasaki commented 8 years ago

cool! So it looks like this will also close issue #30 .

ysavourel commented 8 years ago

Just to be clear: Marta's fix works when the input is:

<span class="test1"></span>Welcome to...

not with:

<span class="test1" />Welcome to...

In that case we do get the span node with the a child node: The text that should be coming after. And Marta's fix (correctly) won't be triggered since the element is not empty.

borriellom commented 8 years ago

Yes, my fix only works with

<span class="test1"></span>Welcome to...

fsasaki commented 8 years ago

Thanks & that is ok, the example document attached to #30 https://github.com/freme-project/e-Internationalization/files/78541/input.txt also uses the syntax that you mention, with an explicit closing tag for the empty element.

borriellom commented 8 years ago

I've just realized that today I wrote my comments in issue 11. They are related to #30. Sorry about that.

fsasaki commented 8 years ago

Is this fixed, @borriellom ? Just asking because I would like to demo the functionality at the XML Prague conference as part of our presentation. I just tried with the developers API and I still got enriched attributes.

borriellom commented 8 years ago

It is fixed on the branch issue_11. It will be fixed in the deployed version when the branch will be merged with the master.

pheyvaer commented 8 years ago

@borriellom Do you have any idea when this will happen?

borriellom commented 8 years ago

I don't know. I wanted to wait for Jan, but I think we can merge as soon as possible, if it's necessary.

fsasaki commented 8 years ago

Thanks a lot for fixing #30, Marta. One question: is this issue #11 still waiting for a fix? Just asking on "Some tag attributes (i.e. content in meta tag, alt in img tag) are considered as text units by Okapi" Processing HTML with e-Internationalisation I still get attribute value processing, it seems.

borriellom commented 8 years ago

It is fixed but still not available on the deployed version. It will be available after the merge between master and issue_11 branches. I've just issued a pull request. I just wanted a confirmation from Jan before proceeding with the merge. Anyway if you are keen to have all issues fixed on the master branch I can merge them right now. Just let me know.

fsasaki commented 8 years ago

Thanks, Marta. OK by me to wait for Jan next week.

borriellom commented 8 years ago

Merge done

jnehring commented 8 years ago

@borriellom can we close this issue?

katia-vistatec commented 8 years ago

Yes, we can close the issue

freme-project / e-Internationalization