freme-project / e-Internationalization

Apache License 2.0
0 stars 0 forks source link

handling of inline markup #5

Open fsasaki opened 9 years ago

fsasaki commented 9 years ago

Mail thread so far, see below. Please put links to discussions, examples etc. into this issue.

Mail from Milan, 7 September:

Bellow I drafted simple example of how the text in HTML/XML markup can be represented in NIF together with a provenance information (nif:wasConvertedFrom) to the original HTML/XML document.

Your XML doc: <p>Welcome to <strong>Turin!</strong></p>

The corresponding NIF: http://freme-project.eu/doc1#char=0,17 a nif:Context , nif:String , nif:RFC5147String ; nif:beginIndex "0" ; nif:endIndex "17" ; nif:isString "Welcome to Turin!"^^xsd:string .

http://freme-project.eu/doc1#char=0,11 a nif:String , nif:RFC5147String ; nif:beginIndex "0" ; nif:endIndex "11" ; nif:anchorOf "Welcome to "^^xsd:string . nif:wasConvertedFrom http://freme-project.eu/doc.html&xpath=/p/text()[1] .

http://freme-project.eu/doc1#char=11,17 a nif:String , nif:RFC5147String ; nif:beginIndex "11" ; nif:endIndex "17" ; nif:anchorOf "Turin!"^^xsd:string . nif:wasConvertedFrom http://freme-project.eu/doc.html&xpath=/p/strong/text()[1] .

This is actually the same (or very similar) approach of converting ITS->NIF and NIF->ITS.

The NIF is then sent to FREME e-Service, which results should be then integrated back in the HTML/XML. Lets assume e-Entity recognizes "Turin!" as an entity. Then we integrated this by attaching the annotation to the parent element of the text node. The text node is addressed with "/p/text()[1]" which points to <strong>Turin!</strong> and the parent element is "<strong>".

So the entity information (its link) we can then add and the final results will look like this:

<p>Welcome to <strong its-ta-ident-ref="http://dbpedia.org/resource/Turin">Turin!</p>

There are other situations that we should discuss, such as overlaps, enrichments as substrings of text nodes, .... but lets first see if this small examples is something we deal with.

\CCing also Sebastian H. and Martin B.

Thanks, Milan

philinthecloud commented 9 years ago

Marta and I think that Milan's proposal should work for non-XLIFF formats.

I was confused because in XLIFF native tags are already abstracted away:

<p>Welcome to <strong>Turin</strong>!</p>

becomes

<source>Welcome to <bpt id="1"><strong></bpt>Turin<ept id="1"></strong></ept>!</source>

or maybe

<source>Welcome to <g id="1">Turin</g>!</source>

which means we have to be sure we can re-construct the native tag and add the necessary enrichment. Marta and I will do some further testing for our XLIFF case.

When parsing content that already has ITS tags the NIF step will have to know how to identify them and create equivalent NIF properties and use the itsrdf ontology to identify them.

For non-ITS tags we think we should keep track of them but not put them into the NIF stream: the NIF stream would then just contain plain text.

philinthecloud commented 9 years ago

During the call on 2015-09-09 [@philinthecloud, @borriellom, @m1ci, @jnehring] it was proposed that to facilitate round tripping two Context's should be included in the generated NIF: (a) complete document including all markup, and (b) plain text only.

Text offsets would refer to the plain text only context. Encoding of the plain text context would need to be utf-8.

@m1ci Could you update Felix's example above with the additional "native document" context?

@philinthecloud and @borriellom to confirm timelines for NIF step enhancement in order of priority:

  1. inline markup
  2. ITS markup
  3. support for converting additional native file formats.
fsasaki commented 9 years ago

Is it possible to have a general solution for this? E.g. the original format may allow for or how can the conversion process "know" which orginal markup to produce?

2015-09-09 12:34 GMT+02:00 philinthecloud notifications@github.com:

During the call on 2015-09-09 it was proposed that to facilitate round tripping two Context's should be included in the generated NIF: (a) complete document including all markup, and (b) plain text only.

Text offsets would refer to the plain text only context. Encoding of the plain text context would need to be utf-8.

@m1ci https://github.com/m1ci Could you update Felix's example above with the additional "native document" context?

@philinthecloud https://github.com/philinthecloud and @borriellom https://github.com/borriellom to confirm timelines for NIF step enhancement in order of priority:

  1. inline markup
  2. ITS markup
  3. support for converting additional native file formats.

— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Internationalization/issues/5#issuecomment-138868120 .

jnehring commented 9 years ago

Is it possible to have a general solution for this? E.g. the original format may allow for or how can the conversion process "know" which orginal markup to produce?

I am not so sure about the difference of the two but I think one is HTML and the other one is XLIFF. The outformat parameter can be set to produce either HTML or XLIFF.

m1ci commented 9 years ago

@philinthecloud, @borriellom: I extended the example and documented it here https://docs.google.com/document/d/1LWZm306shR1tLz8rCkirc6R5dFjbsz5xnB7rltwYINQ/edit Feel free to comment, directly in the document (or here) if there are any uncertainties.

Is it possible to have a general solution for this? E.g. the original format may allow for <span its-ta-ident-ref=".."> or <span its:taIdentRef="..."> how can the conversion process "know" which original markup to produce?

See the example I've developed, which keeps the source document with its original markup, and only the enrichments are processed and integrated in the source document.

borriellom commented 9 years ago

@m1ci The example makes sense. I think that your solution could work. But you say that only Context-2 must be sent to FREME and this still makes sense. But how could this happen? Should I generate two NIF files: one containing Context-2 (the one sent to FREME) and one file (containing both contexts) to be used for integrating enrichments back to the original document? Or will the broker extract the proper context before sending it to the e-Services?

m1ci commented 9 years ago

Should I generate two NIF files: one containing Context-2 (the one sent to FREME) and one file (containing both contexts) to be used for integrating enrichments back to the original document?

Yes.

Or will the broker extract the proper context before sending it to the e-Services?

No, its better to send just one context to the e-Services. At least, at the moment they consider one NIF context at the input.

jnehring commented 9 years ago

We not merge the two NIF files? When we split it in two files, how can we pass down the data through the various stages of a single API request or even through a pipeline? These problems are solved straight forward when we put both NIF files in the same POST body.

I see difficulties with this bit:

<http://freme-project.eu/doc2#char=0,17>
        a                     nif:Context , nif:String , nif:RFC5147String ;
        nif:beginIndex        "0" ;
        nif:endIndex          "17" ;
        nif:isString          "Welcome to Turin!"^^xsd:string .

<http://freme-project.eu/doc1#char=0,17>
        a                     nif:Context , nif:String , nif:RFC5147String ;
        nif:beginIndex        "0" ;
        nif:endIndex          "17" ;
        nif:isString          "<p>Welcome to <strong>Turin!</strong></p>"^^xsd:string .

Our APIs enrich all literals of triples with nif:isString property. But in this case doc2 should be enriched. doc1 cannot be enriched because it does contain HTML markup.

Can we somehow add a property "do not process" to doc1? It is something like itsrdf:transate "no" but targeted not only to translation but to all enrichment services.

m1ci commented 9 years ago

how can we pass down the data through the various stages of a single API request or even through a pipeline?

You just the doc2 RDF document with single context which is process by one service, or multiple services in a pipeline and the results are returned back to the client (in our case that is Ocelot). I don't see any problem with this.

Our APIs enrich all literals of triples with nif:isString property. But in this case doc2 should be enriched. doc1 cannot be enriched because it does contain HTML markup.

Yes, thats why I suggested to send RDF with just one NIF context which does not contain the markup.

Can we somehow add a property "do not process" to doc1? It is something like itsrdf:transate "no" but targeted not only to translation but to all enrichment services.

You can do this only with the e-translation service and the itsrdf:transate "no". BTW, if the solution above (with sending one context) is not working, then we can talk about new prop.

philinthecloud commented 9 years ago

I like the idea of a new property rather than two files.

philinthecloud commented 9 years ago

@fsasaki

Is it possible to have a general solution for this? E.g. the original format may allow for <span its-ta-ident-ref=".."> or <span its:taIdentRef="..."> how can the conversion process "know" which original markup to produce?

Sorry, I do not understand.

fsasaki commented 9 years ago

Sorry, I see that my question is off-topic, forget about it for the moment - I'll ask again once the general process is working.

fsasaki commented 9 years ago

The solution proposed at https://docs.google.com/document/d/1LWZm306shR1tLz8rCkirc6R5dFjbsz5xnB7rltwYINQ/edit# shows how to enrich HTML content with one piece of enrichment, produced by e-Entity. There is the challenge that FREME produces several pieces of enrichment, and it would blow up the inline content if everything is stored here. One solution could be to generate in the original (HTML) format span or other suitable elements with IDs and then refer to these in a separate (script or other, format dependend) element. That element then can contain as much enrichment as needed. See an example here https://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Feb/att-0246/multiple-ann-with-id-plus-standoff.html The example contains enrichment in XML, but the same approach could be realized with JSON-LD. E.g. having <span id=enrichment-1> ... and in a "script element": "@id" : "enrichment-1"

fsasaki commented 9 years ago

I remembered a presentation by @ysavourel that may be helpful, see https://www.w3.org/community/ld4lt/wiki/images/0/00/Feisgiltt-20140614-ld4lt-savourel.pdf esp. slides 8-12 about how to store the outcome of enrichment tasks.

m1ci commented 9 years ago

@fsasaki, @borriellom

There is the challenge that FREME produces several pieces of enrichment, and it would blow up the inline content if everything is stored here.

In order to support several pieces of enrichment, for the ITS related enrichment @borriellom could the ITS/RDF ontology and lookup for enrichments with those properties - Just an idea.

One solution could be to generate in the original (HTML) format span or other suitable elements with IDs and then refer to these in a separate (script or other, format dependend) element. That element then can contain as much enrichment as needed.

Yes, spans are fine and one span can contain much enrichment information. This is actually CASE 1 from the example.

Regarding the issue "blow up the inline content" - another solution could be to store the enrichments in triple store and refer them via ID so they can be fetched on demand.