MicrosoftTranslator / DocumentTranslator-Legacy

Microsoft Document Translator (Archive) - Replaced by the MicrosoftTranslator/DocumentTranslation project in this repository.
Other
409 stars 152 forks source link

failed to translate html file with MS document translator #156

Closed laf-1226 closed 3 years ago

laf-1226 commented 3 years ago

We translated some html files with MS document translators, all html files are well-formed, but translation of some html files failed with error message: Error while processing document: xxxx.html Object reference not set to an instance of an object.

Here is the example file which failed to be translated. The sample file is a little complicated, we created a small one to reproduce the error.

  • xxxxxx
  • if

  • size is really big (for example, text length>5000 chars), this file will fail with MS document translator. While if we put some text between

    and

  • , the file will be translated successfully.

    we don't know why, but we are wondering if something wrong with the code below:

        private static void AddNodes(HtmlNode rootnode, ref List<HtmlNode> nodes)
        {
            string[] DNTList = { "script", "#text", "code", "col", "colgroup", "embed", "em", "#comment", "image", "map", "media", "meta", "source", "xml"};  //DNT - Do Not Translate - these nodes are skipped.
            HtmlNode child = rootnode;
            while (child != rootnode.LastChild)
            {
                if (!DNTList.Contains(child.Name.ToLowerInvariant())) {
                    if (child.InnerHtml.Length > maxRequestSize)
                    {
                        AddNodes(child.FirstChild, ref nodes);
                    }
                    else
                    {
                        if (child.InnerHtml.Trim().Length != 0) nodes.Add(child);
                    }
                }
                child = child.NextSibling;
            }
        }

    Sorry that i failed to upload the sample files, either in html or in .docx format.
    Has someone met the similar issue with MS document translator? And does anyone know how to fix this issue? Many thanks!

  • chriswendt1 commented 3 years ago

    Hi @laf-1226 , Thank you for reporting the error. There is a newer, and I think better document translation utility in the /DocumentTranslation project, in this same repository. Can you give it a try and see whether this works for your files? Please let me know if you cannot migrate to the newer utility. If that also fails, please attach a sample document. Maybe try renaming the file as a .txt file and then attach.

    laf-1226 commented 3 years ago

    Hi Chris,

    Thanks a lot for your reply. We confirmed that we are using the Document translator version 2.9.3, which seems the latest version of Document Translator. And the code in this ticket is also from version 2.9.3. Could you let us know where to get the better document translation utility if there is newer version?

    I attached the sample files are you suggested: File_BSN_AP2.txt new_failed.txt new_works.txt You can see error message with the first 2 files, and the 3rd one can be MTed successfully.

    Thanks again, aifang

    chriswendt1 commented 3 years ago

    Thanks, aifang, for attaching the files. I will look at that tomorrow. By the newer utility I meant this: https://github.com/MicrosoftTranslator/DocumentTranslation.

    laf-1226 commented 3 years ago

    Thank you, Chris! We did some change on the following code: in private static void AddNodes(HtmlNode rootnode, ref List nodes) change while (child != rootnode.LastChild) to while (child != null)

    After rebuild, the document translator can handle the html files which failed before (e.g. the 2 sample files). Although it works for the files failed before, we are not very sure if this change is correct or if it will bring some other issues, could you help to check it? Many thanks!

    chriswendt1 commented 3 years ago

    The code was not all all prepared for a single element being larger than maxrequestsize. In my test, your code would not get the element translated either, it would just avoid a failure. In version 2.9.4 I made changes to this function as well as BreakSentences. You may want to take both. I also updated the binaries.