freme-project / e-Internationalization

Apache License 2.0
0 stars 0 forks source link

XML processing: is it possible to select what parts should be enriched? #25

Open fsasaki opened 9 years ago

fsasaki commented 9 years ago

Is it possible to identify in an XML document the parts that should (not) be enriched? I am asking since many XML formats contain fields that are not suitable for enrichment. See attached ONIX file, the related request and output. onix-input.txt request3.txt out3.txt

borriellom commented 9 years ago

I'm not an expert of XML files. I think that if we agree on a special attribute, I can discard those text units embedded in a tag with such attribute. The final result would be that those text units are not inserted in the final NIF file and therefore they won't be enriched.

Anyway I don't know if defining such an attribute is feasible.

fsasaki commented 9 years ago

If that is feasible for your workflow, I would suggest the following: allow an XPath expression (including setting of XML namespace prefixes) that selects what parts should be used for enrichment. The user would provide that XPath erpression calling e-Internationalisation. The approach "text units embedded in a tag with such attribute." would require changing the XML content itself - this is probably not feasible.

The XPath exxpress is not needed for roundtripping, btw. And: this would be an extremly powerful feature IMO, since users can adapt "their" XML vocabulary for FREME on the fly.

2015-10-28 20:29 GMT+09:00 borriellom notifications@github.com:

I'm not an expert of XML files. I think that if we agree on a special attribute, I can discard those text units embedded in a tag with such attribute. The final result would be that those text units are not inserted in the final NIF file and therefore they won't be enriched.

Anyway I don't know if defining such an attribute is feasible.

— Reply to this email directly or view it on GitHub https://github.com/freme-project/e-Internationalization/issues/25#issuecomment-151808920 .

m1ci commented 8 years ago

@fsasaki addressing content with XPath is fine, but that would work only with XML and XML based formats, but not RDF/NIF. What about introducing ITS and corresponding NIF property for selection of elements for processing? Then, it can be re-used in XML/ITS but also in RDF/NIF. What do you think?

fsasaki commented 8 years ago

Hi Milan, on "What about introducing ITS and corresponding NIF property" this makes sense too, but from the XML people it is a different use case. E.g. in an XML file like this http://mariage.uvic.ca/anth_doc.xml?id=la_femme_battant you may want to process only the "p" elements with FREME. So it makes sense not to send the whole file to FREME, which would also slow down the FREME / NIF processing. Now, in a specific "p" element you may want to say "this sub string should not be translated" - which you can do with the ITS representation of RDF/NIF, and e-Translation allows that already, AFAIK. But you would still send the whole sub string to FREME (which would then benefit from the RDF/NIF + ITS information). Wrt to XPath, it seems this will not end up in e-Internationalisation but in the batch processing tool.

fsasaki commented 8 years ago

Just to give an idea how the preprocessing works using XPath via XSLT, see here http://fsasaki.github.io/stuff/freme-xml-processing-example/ the demo allows to process TEI (XML relevant for digitial humanities) with FREME. The XML processing is done via a XSLT pre-processing step

m1ci commented 8 years ago

Very nice demo! As for the new property proposal - If I understand well the use case, by using XPath there is no need to intervene in the XML (add instruction properties what to and what not to process). Yes, this makes sense to me.

And yes, I agree, this is smth for the batch processing tool.