MicrosoftTranslator / DocumentTranslator-Legacy

Microsoft Document Translator (Archive) - Replaced by the MicrosoftTranslator/DocumentTranslation project in this repository.
Other
408 stars 153 forks source link

PPTX files not returning generalnn translation #44

Open holmbergius opened 6 years ago

holmbergius commented 6 years ago

We're not receiving generalnn category NMT translations through Document Translator for PowerPoint. We might be able to help fix. Are you able to identify where the bug might be occurring?

Thank you! Dell EMC G11N Team

chriswendt1 commented 6 years ago

That would probably be in https://github.com/MicrosoftTranslator/DocumentTranslator/blob/master/TranslationAssistant.Business/DocumentTranslationManager.cs.

I think there are two issues involved: a) As you observed, the generalnn category is lost b) Formatting of runs in text inserts harmful tags

For a) I have no clue where that gets lost. For b) probably best to remove tagging from inside individual elements, or certainly from within sentences. May use the BreakSentences method to determine sentence breaks.

See that I recently used a format cleanup logic for Word docs, maybe something similar applies to PPTX.

chriswendt1 commented 6 years ago

If you didn't find the solution yet, I'll have a look on the weekend.

holmbergius commented 6 years ago

Thanks! I have been looking but haven't found code that would send PPTX files down a different submission route than DOCX for example. Your help would be awesome. Thank you!

holmbergius commented 6 years ago

P.S. I'm testing on 1.4.2U.

chriswendt1 commented 6 years ago

It is using the neural translation system just fine. The problem is that individual bullets are split into multiple segments by markup that PowerPoint applies more or less randomly. As a result, various substrings of a sentence are translated individually, breaking the sentence congruency. Unfortunately the OpenXML markup simplifier I used for Word documents doesn't work on PowerPoint docs. That'll be an undertaking.

holmbergius commented 6 years ago

Anything we can help with?

holmbergius commented 6 years ago

Hi Chris, any chance of a fix soon...or could we collaborate on one? This is critical for us, so we'd love to get or help develop a solution.

Thank you! Jason

georgkirchner commented 6 years ago

Do you have an update for us on this one, Chris?

holmbergius commented 6 years ago

Hi Chris, any chance Microsoft can fix this one? Thank you! :)

holmbergius commented 6 years ago

Just checking in: PowerPoint is a great Microsoft document format. Can we work with you to get the bullet points translating correctly with NMT through Document Translator?

Thank you!

chriswendt1 commented 6 years ago

Sorry for the lack of an answer. I do not have a brilliant idea here, other than adopting the markup simplification strategy that the OpenXML markup simplifier offers for Word docs. In my observation it worked well there, but losing markup in PowerPoint is quite a bit worse than losing it in Word, in terms of the end user perceived effect. I see these options: 1) Translate PowerPoint openXML by hand. The Translator API preseves HTML tags natively. Must be sentence-internal HTML tags to avoid breaking up the bullet. May need some tag renaming logic back and forth. 2) Drill into the OpenXML simpliciation toolkit to add PowerPoint as a format suitable for simplification. That probably comes at the cost of loosing something, so it'll be more work with a less usable outcome. 3) In a preprocessor, parse the original OpenXML, simplify the bullet markup by hand. Remember the simplifications. In a post-processor, re-apply the markup, using word alignment information from teh Translator API as a helper. This is the intellectually more complex task, but avoids digging deep into the otherwise nicely working toolkits, and promises the best results.

I currently do not have time to go deeper on either, and of course would welcome any help.

holmbergius commented 6 years ago

Hi Chris, are you aware of Translator Tools' Document Cleaner? We have been pre-processing our Word documents (using Tag Cleaner feature for tag reduction) before using Document Translator, and the quality improvements from NMT are HUGE with some short pre-processing. Unfortunately, it is a plugin and not a standalone set of routines, so it's a manual step and we would love it to be automated.

I would highly recommend looking at the steps it takes for Word and PowerPoint at least. Document Translator would be significantly stronger if additional tag reduction were incorporated.

More info here:

http://www.translatortools.net/word-doccleaner.html

Tag Cleaner – this tool re-formats the document in order to minimize tags displayed when you import the document into your CAT tool. Tags are a result of complex formatting applied by OCR and PDF conversion tools in order to reproduce the appearance of the source PDF / scanned document. 99% of the time this complex formatting is not necessary, as it was never used in the original document.

Tag Cleaner performs the following (optional) operations to minimize tags and make the document more user-friendly:

fixes invisible formatting problems, resets uneven character spacing, removes text and paragraph shading, turns ‘black’ font color into ‘automatic’ color, removes text highlighting, removes manual hyphenation, removes character styles, leaving direct formatting only, removes some types of unnecessary bookmarks, normalizes (makes standard) font colors inside each paragraph, normalizes (makes standard) font sizes inside each paragraph, normalizes (makes standard) fonts inside each paragraph, and finds specific symbols from symbol fonts and converts them to corresponding readable characters from standard fonts.

chriswendt1 commented 5 years ago

Try Word again with version 2.1.0. The update makes OpenXmlPowerTools work again without destroying the XML well-formedness, thanks to the tip from @jsypkens. That is not a help for PowerPoint, because OpenXMLPowerTools has no simplifier for PowerPoint.

chriswendt1 commented 5 years ago

Still needs work on PowerPoint.