browsermt / bergamot-translator

Cross platform C++ library focusing on optimized machine translation on the consumer-grade device.
http://browser.mt
Mozilla Public License 2.0
327 stars 36 forks source link

Content types? #435

Open kpu opened 2 years ago

kpu commented 2 years ago

I have 3 MSc students adding Word support. The most logical way of handling this is extending HTML support to OOXML, most of which is configuring the HTML Options object (though there's also stuff like multiple spaces are semantically meaningful). Doing that is the responsibility of the students. Question is how this should be exposed in the native interface. What we have right now is a boolean for HTML. Should it change to content type text/plain, text/html, and application/vnd.openxmlformats-officedocument.wordprocessingml.document ?

jelmervdl commented 2 years ago

Using mime types instead of a boolean sounds like good way to indicate how the content should be handled. Also pretty future proof. Especially if we can just associate a mime type with a processing class somewhere in the code.

I'd warn them about extending the current HTML object to support OOXML. That code has already become pretty complicated on its own, and filled with assumptions about how HTML is used semantically. Some of that is encapsulated in the Options HTML Object, but there's also assumptions in how tags that are inserted back in the element need to align with whitespace (i.e. it turns hello_<b>_world_</b>! into hello_ _<b>world</b>_!). It could be a source of frustration, and I would not be opposed to just copying the HTML class for the OOXML one, and stripping out all the bits you don't need, just to avoid possible weird interactions.

The xh_scanner.{h,cpp} files both also have hard-coded assumptions about HTML, e.g. which tags should never have a closing element, like <input>, and which tags should never have their contents parsed, e.g. <script>. For XML parsing those rules would need to be disabled, and support for CDATA needs to be added back. Not sure whether its easier adding those back in through if-statements (and then also having to add support for these new sections to the HTML class) or having a copy of parts of that code in a separate XML parsing class. The latter might be more maintainable.