DITA XML format - Githubissues

MicrosoftTranslator / DocumentTranslation

Command Line tool and Windows application for document translation, a local interface to the Azure Document Translation service for Windows, macOS and Linux.

Other

154 stars 37 forks source link

DITA XML format #111

Open tristanmaccana opened 1 year ago

tristanmaccana commented 1 year ago

Any plans to add DITA to translatable format?

chriswendt1 commented 1 year ago

Before you sent this request: No. Do you have a hint on any libraries or existing processing logic? The key is to extract exactly the translatable elements, escaping any sentence-internal tags, reinserting the internal tag values at the right places, and then replacing the translated segments into the original markup. This is best done by proper DOM parsing. Do you have a set of DITA documents you want to translate and can share a link to? If there was an existing DITA to HTML and HTML to DITA converter, that could work without new code to write.

chriswendt1 commented 1 year ago

Hi Tristan, I think the appropriate process will be to use Fluenta (https://github.com/rmraya/Fluenta) to extract the translatable elements from DITA files referenced in a DITA map into XLIFF. You then use Document Translator to translate the XLIFF. Then use Fluenta again to re-insert the translated elements into the DITA files. Let us know if you used this successfully. For convenience, this could be arranged in a workflow controlled by Document Translator. Document Translator could invoke Fluenta as an external process before and after translating the XLIFF.

tristanmaccana commented 1 year ago

Hi Chris, Thanks very much for getting back to me. Yes we have a very large set of manuals/ installer guides etc. that we would like to translate from DITA using Microsoft Translator so would like to get an efficient workflow created. Using Fluenta we had an issue with inline elements in the target. See the image for ph ids in random order and some are empty. Can I confirm if you are using XLIFF 1.2 only? inline errors-image

chriswendt1 commented 1 year ago

Hi Tristan, It seems to me as if Fluenta should not have processed the ph element as translatable. Can you teach Fluenta to exclude it from translation? The content of the ph element is not translatable. You may get fairly random output in translation.

tristanmaccana commented 1 year ago

HI Chris, We nearly always have to translate inline elements such as uicontrol elements which are localized. Perhaps we have to translate in a separate process? Here is our xliff file after Translator [https://drive.google.com/file/d/1ZUqIoGAnOceNCN_Cubmxo5sPF_zrG2tS/view?usp=share_link] I will try and share the source file with you later

chriswendt1 commented 1 year ago

Thanks for sharing the sample. Looking at the first segment with <ph> elements inside, the segment to translate is this:: You can configure images and text for each space level in the <ph ctype="x-other" id="0"><uicontrol class="+ topic/ph ui-d/uicontrol "></ph>Green Hub<ph id="1"></uicontrol></ph> section in the <ph ctype="x-other" id="2"><ph keyref="brand" status="removeContent" class="- topic/ph "></ph><ph ctype="x-other" id="3"><keyword class="- topic/keyword "></ph>OpenBlue<ph id="4"></keyword></ph><ph id="5"></ph></ph> Enterprise Manager <ph ctype="x-other" id="6"><uicontrol class="+ topic/ph ui-d/uicontrol "></ph>Setup<ph id="7"></uicontrol></ph> page. It's a horrible mess that won't make much sense to Translator. There is too much markup inside the translatable string, which is escaped via < and >. Have you tried unescaping this as if it was proper markup? Alternatively you could compress the untranslatable spans to something like #markup1 and see if that makes it better.

chriswendt1 commented 1 year ago

I see what you mean. The <ph> is a legal XLIFF element and should have passed through as is. I tried unescaping the internal markup and it fails. No surprise, it wouldn't be legal XLIFF. \ I'll try a few more things.

tristanmaccana commented 1 year ago

Hi Chris I'll share a link to the ditamap with you. I would appreciate your expert eye on this and any insights on how to convert correctly if you get a chance https://drive.google.com/file/d/1PDgIXgGq1qUjKbnMzovpC8flIiIYdkaT/view?usp=share_link Thanks (ps. Just saw your earlier replies, thank you!)

chriswendt1 commented 1 year ago

Hi @tristanmaccana, I am currently working on adding SRT and VTT file format to Document Translation. Strategy is to transform client-side to a supported document format, packing the additional information in a comment, translating the supported format regularly, and then unpacking at the client. MD is a candidate, because it is so simple to parse. HTML could work as well. Maybe something like this could help here as well, packing additional information into the untranslatable XLIFF markup.