freme-project / e-Internationalization

Apache License 2.0
0 stars 0 forks source link

[test] how dc:identifier is used in rountripping HTML-NIF-HTML #22

Open m1ci opened 9 years ago

m1ci commented 9 years ago

For an HTML

<html>
<head>
    <title>Roundtripping</title>
</head>
<body>
<p>Welcome to Dublin</p>
</body>
</html>

You create following NIF:

@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif:   <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .

<http://freme-project.eu/#char=0,31>
        a               nif:RFC5147String , nif:Context , nif:String ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "31"^^xsd:nonNegativeInteger ;
        nif:isString    "Roundtripping Welcome to Dublin"@en .

<http://freme-project.eu/#char=14,31>
        a                     nif:Phrase , nif:RFC5147String , nif:String ;
        nif:ReferenceContext  "http://freme-project.eu/#char=0,31" ;
        nif:anchorOf          "Welcome to Dublin"@en ;
        nif:beginIndex        "14"^^xsd:nonNegativeInteger ;
        nif:endIndex          "31"^^xsd:nonNegativeInteger ;
        dc:identifier         "2" .

<http://freme-project.eu/#char=0,13>
        a                     nif:Phrase , nif:RFC5147String , nif:String ;
        nif:ReferenceContext  "http://freme-project.eu/#char=0,31" ;
        nif:anchorOf          "Roundtripping"@en ;
        nif:beginIndex        "0"^^xsd:nonNegativeInteger ;
        nif:endIndex          "13"^^xsd:nonNegativeInteger ;
        dc:identifier         "1" .
borriellom commented 9 years ago

We thought about the dc:identifier property during a call about NIF conversion. We agreed that it could be useful for XLIFF roundtripping: it keeps trace of the related translation unit. It has no any relevant meaning while converting HTML files.

This content is converted back to HTML by using the NIF file having markups in the context. This is the markups NIF file generated with that HTML (I added it to the documentation as well)

@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix itsrdf: <http://www.w3.org/2005/11/its/rdf#> .
@prefix nif:   <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#> .
@prefix dc:    <http://purl.org/dc/elements/1.1/> .

<http://freme-project.eu/doc1/#char=0,121>
        a               nif:RFC5147String , nif:Context , nif:String ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "121"^^xsd:nonNegativeInteger ;
        nif:isString    "<!DOCTYPE html>\r\n<html><head>\r\n\t<title>Roundtripping</title>\r\n</head>\r\n<body>\r\n<p>Welcome to Dublin</p>\r\n\r\n</body></html>"@en .

<http://freme-project.eu/#char=14,31>
        a                     nif:RFC5147String , nif:String ;
        nif:anchorOf          "Welcome to Dublin"@en ;
        nif:beginIndex        "14"^^xsd:nonNegativeInteger ;
        nif:endIndex          "31"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://freme-project.eu/#char=0,31> ;
        nif:wasConvertedFrom  <http://freme-project.eu/doc1/#char=82,99> ;
        dc:identifier         "2" .

<http://freme-project.eu/#char=0,13>
        a                     nif:RFC5147String , nif:String ;
        nif:anchorOf          "Roundtripping"@en ;
        nif:beginIndex        "0"^^xsd:nonNegativeInteger ;
        nif:endIndex          "13"^^xsd:nonNegativeInteger ;
        nif:referenceContext  <http://freme-project.eu/#char=0,31> ;
        nif:wasConvertedFrom  <http://freme-project.eu/doc1/#char=39,52> ;
        dc:identifier         "1" .

<http://freme-project.eu/#char=0,31>
        a               nif:RFC5147String , nif:Context , nif:String ;
        nif:beginIndex  "0"^^xsd:nonNegativeInteger ;
        nif:endIndex    "31"^^xsd:nonNegativeInteger ;
        nif:isString    "Roundtripping Welcome to Dublin"@en .

Could you explain what you mean with the last question, please?

m1ci commented 9 years ago

We thought about the dc:identifier property during a call about NIF conversion. We agreed that it could be useful for XLIFF roundtripping: it keeps trace of the related translation unit. It has no any relevant meaning while converting HTML files.

OK, thanks for the reminder.

This content is converted back to HTML by using the NIF file having markups in the context.

The NIF context containing the source markup is not returned. Why?

Could you explain what you mean with the last question, please?

The same as the question above. Why the NIF context containing the markup is not returned?

jnehring commented 9 years ago

The same as the question above. Why the NIF context containing the markup is not returned?

We could include it in the NIF response. I thought we create two separate NIF documents and thats why I did not merge the two NIF documents before returning it to the user. Also it is unclear to me which URI we use for this information.

borriellom commented 9 years ago

We agreed to produce two different NIF files, because we needed two contexts: one including markups and one containing only plain text. The reason was because FREME e-Services cannot deal with a NIF file having two contexts. Moreover, since the context including markups is only needed for performing the round-tripping (it is not relevant for the final user), it is not returned by the service and it is temporary saved on the local machine.

Regarding URI, thank you for reminding that. We should think of a strategy for generating unique URIs, so that we are sure of merging the correct files when doing round-tripping. It's already possible to choose a URI from outside and pass it to the conversion method. Anyway at the moment http://freme-project.eu/ is the base URI for plain text context, while http://freme-project.eu/doc1/ is the base URI for markups context. It is a temporary solution and I think it should be changed.

m1ci commented 9 years ago

The reason was because FREME e-Services cannot deal with a NIF file having two contexts. Moreover, since the context including markups is only needed for performing the round-tripping (it is not relevant for the final user), it is not returned by the service and it is temporary saved on the local machine.

OK, makes sense.

Regarding URI, thank you for reminding that. We should think of a strategy for generating unique URIs, so that we are sure of merging the correct files when doing round-tripping.

Indeed. I think we should use hash values generated out from the content. In NIF, the URIs for Strings can be 1) "Offset Based Strings" - this is what we are using now, and also 2) "Context Hash Based String" - remain more robust regarding document changes. See the guidelines how they are constructed: http://jens-lehmann.org/files/2012/ekaw_nif.pdf (page 4).

It's already possible to choose a URI from outside and pass it to the conversion method.

In some scenarios, this can be an option. But for the round-tripping, its maybe better if the URIs are generated at the server-side. Lets see what others think.

jnehring commented 9 years ago

Regarding UUIDs in general

We could also use Javas unique ID generator: java.util.UUID Java API Doc and a short tutorial. We use this technique to generate tokens. FREME tokens are actually java UUIDs.

The advantage of UUIDs over hash values are that they are truly unique. When someone sends text plaintext from two different sources but with the same content to FREME, then using hash values they will get the same URLs which is IMO problematic.

No matter if we use hash values or UUIDs we could use these unique URIs in two areas:

I suggest to move the discussion about unique URIs to a new issue in technical discussion and make it a feature of a future version of FREME, e.g. FREME 0.5. Or do you think this is an bug that needs to be fixed right now?

Regarding UUIDs for roundtripping

We should think of a strategy for generating unique URIs, so that we are sure of merging the correct files when doing round-tripping.

In the current implementation of roundtripping we merge the correct files. Actually we generate the URI http://freme-project.eu for all resources send through e-Internationalization. We separate resources that do not belong together not via the NIF URIs but because they are generated in different HTTP requests.

m1ci commented 9 years ago

Thanks for the proposal Jan. Personally, I don't like the idea of using UUID for the main reason that it is not compatible with the NIF spec. We should stick to the NIF spec.

I suggest using "hash based" URIs with a unique prefix base for the URI. Example http://freme-project.eu/doc1/#hash_0_30_067e61623b6f4ae2a1712470b63dff00

Where the http://freme-project.eu/doc1/ is unique part proposed by the client or server. and #hash_0_30_067e61623b6f4ae2a1712470b63dff00 is hash value representing the content. For more on constructing Context-Hash-based URIs see http://jens-lehmann.org/files/2012/ekaw_nif.pdf (page 4)