Closed borriellom closed 8 years ago
I think you should be able to see where the text unit is coming from in the restype
attribute. For example the restype
value for the keywords content is probably x-content
.
Thank you Yves. I checked, but I always receive TextUnit events with type field set to null.
Ah, with the ITS/HTML5 Filter that is possibly right. (I was thinking about the default HTML Filter).
Well I guess we should modify the ITS/HTNL5 filter to try to fill in the type. It's curious that it is not done already. Maybe there is a reason.
For us, this method can be the place to fill in the type
of the TextUnit tu
, in case it's an attribute. This could just be a quick fix, not a solid solution.
So here's the method:
private String addAttributeTextUnit (Attr attr,
boolean addToSkeleton)
{
String id = String.valueOf(++tuId);
ITextUnit tu = new TextUnit(id, attr.getValue(), true, mimeType);
// Deal with inline codes if needed
if ( params.useCodeFinder ) {
applyCodeFinder(tu.getSource().getFirstContent());
//TODO: we could have target entries too in some case (ITS targePointer) so we should test for that
// and process the target too if needed.
}
// Set the ITS context for this attribute and set the relevant properties
// (we could use directly trav, but this allows to avoid many trav.getXYZ() calls)
// Note however, that trav should be used in some cases where the context alone is not enough
ContextItem ci = new ContextItem(
(context.isEmpty() ? attr.getParentNode() : context.peek().node), trav, attr);
processTextUnit(tu, ci, null, attr);
queue.add(new Event(EventType.TEXT_UNIT, tu));
if ( addToSkeleton ) skel.addReference(tu);
return id;
}
Especially the line:
ITextUnit tu = new TextUnit(id, attr.getValue(), true, mimeType);
creates a TextUnit without the type
field filled in. Since we know it is from an attribute, we could do something like:
tu.setType("..."); // where "attribute" could be filled in. But I don't know the right type for attribute.
Then later, in eu.freme.i18n.okapi.nif.filter.NifWriterFilter
this can be used to check if the text comes from an attribute and can be skipped.
But I have no "ready-to-use" code here.
The setType() call has been added (in the latest dev branch). See https://bitbucket.org/okapiframework/okapi/commits/92bd02d6b893c4c8bf825978a156dfb57aea4c83
The setType() call has been added (in the latest dev branch).
Thanks a lot!
OK, so here's how you can fix the issue.
git clone https://bitbucket.org/okapiframework/okapi.git
git checkout dev
cd okapi
mvn -DskipTests clean install
cd okapi/filters
mvn -DskipTests install
This installs the latest dev branch on your local maven repo
Check out branch issue_11 of e-Internationalization. There you see modifications in the pom.xml
and in eu.freme.i18n.okapi.nif.step.NifWriterStep
. They fix the issue, but I leave it to the experts of Vistatec to review this!
The test eu.freme.i18n.okapi.api.EInternationalizationAPITest#testEInternationalizationAPIHTML
will fail because the models are not the same: in the expected model, the text in the attribute is still present, while in the actual resulting model it is gone (which is good).
I reviewed the changes made by @ghsnd. They are fine! In order to make round-tripping work, I made further changes in the NifSkeletonWriterFilter
and NifSkeletonMarkerHelper
. I also edited the expected file in the eu.freme.i18n.okapi.api.EInternationalizationAPITest#testEInternationalizationAPIHTML
test.
Changes have been committed in the branch issue_11
.
I finally spotted the error! It was in okapi ITSFilter.
In both methods addStartTagToSkeleton
and addStartTagToFragment
there was a line of code closing the tag only if it is not HTML5 and it doesn't have children.
tmp.append("<"
+ ((node.getPrefix()==null) ? "" : node.getPrefix()+":")
+ node.getLocalName());
/* adding attributes */
if ( !isHTML5 && !node.hasChildNodes() ) tmp.append("/");
tmp.append(">");
In both methods I added an else branch for dealing with HTML5 tags having no children.
tmp.append("<"
+ ((node.getPrefix()==null) ? "" : node.getPrefix()+":")
+ node.getLocalName());
/* adding attributes */
if ( !isHTML5 && !node.hasChildNodes() ) tmp.append("/");
else if (isHTML5 && !node.hasChildNodes()) {
tmp.append("></"
+ ((node.getPrefix()==null) ? "" : node.getPrefix()+":")
+ node.getLocalName());
}
tmp.append(">");
I built Okapi with my changes and tried to execute the round-tripping on the document provided by Felix and it worked.
I hope @ysavourel could have a look at this code. If it is good, can it be included in the next Okapi release? Please, let me know if you need further information.
Will do. Thanks for debugging and finding the error.
cool! So it looks like this will also close issue #30 .
Just to be clear: Marta's fix works when the input is:
<span class="test1"></span>Welcome to...
not with:
<span class="test1" />Welcome to...
In that case we do get the span
node with the a child node: The text that should be coming after. And Marta's fix (correctly) won't be triggered since the element is not empty.
Yes, my fix only works with
<span class="test1"></span>Welcome to...
Thanks & that is ok, the example document attached to #30 https://github.com/freme-project/e-Internationalization/files/78541/input.txt also uses the syntax that you mention, with an explicit closing tag for the empty element.
I've just realized that today I wrote my comments in issue 11. They are related to #30. Sorry about that.
Is this fixed, @borriellom ? Just asking because I would like to demo the functionality at the XML Prague conference as part of our presentation. I just tried with the developers API and I still got enriched attributes.
It is fixed on the branch issue_11. It will be fixed in the deployed version when the branch will be merged with the master.
@borriellom Do you have any idea when this will happen?
I don't know. I wanted to wait for Jan, but I think we can merge as soon as possible, if it's necessary.
Thanks a lot for fixing #30, Marta. One question: is this issue #11 still waiting for a fix? Just asking on "Some tag attributes (i.e. content in meta tag, alt in img tag) are considered as text units by Okapi" Processing HTML with e-Internationalisation I still get attribute value processing, it seems.
It is fixed but still not available on the deployed version. It will be available after the merge between master and issue_11 branches. I've just issued a pull request. I just wanted a confirmation from Jan before proceeding with the merge. Anyway if you are keen to have all issues fixed on the master branch I can merge them right now. Just let me know.
Thanks, Marta. OK by me to wait for Jan next week.
Merge done
@borriellom can we close this issue?
Yes, we can close the issue
I have just realized about a problem with HTML files converted to NIF by Okapi. Some tag attributes (i.e. content in meta tag, alt in img tag) are considered as text units by Okapi and they are included in the context of the NIF file. Once the NIF file is enriched, when converting back to the original HTML file, it could happen that some attributes have been enriched, thus creating an invalid HTML file.
Example from the test related to issue #9 Original HTML
Enriched HTML
At the moment I don't know a simple way to discern between actual text units and text from attributes. I'll try to find out.