MicrosoftTranslator / DocumentTranslator-Legacy

Microsoft Document Translator (Archive) - Replaced by the MicrosoftTranslator/DocumentTranslation project in this repository.
Other
410 stars 154 forks source link

Infinite loop in SplitList #58

Closed jsypkens closed 5 years ago

jsypkens commented 6 years ago

In DocumentTranslationManager, the method "SplitList" has an infinite loop possibility. When an individual element exceeds the maxSize, it will do the following:

  1. Attempt to add UP TO groupSize elements to the list
  2. Remove them, one by one, as they all exceed the maxSize
  3. Once it gets to the last item in the list, it also exceeds maxSize, so it gets removed
  4. An empty list is added to the "result" list.
  5. Loop - start again at 1 above. Continually adding to the result list.

This is more apparent with Text Translator v3 API, as the element and list sizes are smaller. One contiguous paragraph doesn't need to be very long to cause this behavior.

chriswendt1 commented 6 years ago

Thanks for the tip. I'll have a look.

chriswendt1 commented 6 years ago

@jsypkens, did you have a fix for this already? I'll try to look at it next weekend, but if you have a fix, that would save me.

jsypkens commented 6 years ago

I made a few other changes to the method as well, and the implementation changed a little bit, so I'm not sure if my fix would be what you're looking for.

I separated out the "maxSize" into two parameters - maxListSize and maxElementSize, then looped through each element in values, checking if it would exceed the group or the element size. I also changed the return type to List<List<<Tuple<int,string>>, for two main reasons:

  1. It will track the source line number of each string element, so I can translate all the elements from multiple batches in parallel and then re-assemble the translated elements in the right order
  2. The v3 API calls fail when translating empty strings or whitespace (which is common when extracting text out of documents), so I'm removing those when translating, but re-inserting the white space in the correct place after translation returns

`public static List<List<Tuple<int, string>>> SplitList2(IEnumerable values, int groupSize, int maxListSize, int maxElementSize) { List<List<Tuple<int, string>>> result = new List<List<Tuple<int, string>>>(); int currentLineNumber = 0; int currentListSize = 0;

        List<Tuple<int, string>> newList = new List<Tuple<int, string>>();

        foreach (var value in values)
        {
            if (newList.Count > groupSize)
            {
                result.Add(newList);
                newList = new List<Tuple<int, string>>();
                currentListSize = 0;
            }

            if (value.Length > maxElementSize) //current item is longer than item max, split it on sentence/paragraph
            {
                if (value.Length > maxListSize)
                {
                    //Split this item up! by itself, it's too long for the list. Use BreakSentence API to split on sentences.                        
                    BreakSentenceResponse[] BRR = TranslationServiceFacade.BreakSentences(new string[] { value });
                    List<string> sentences = new List<string>();
                    int currentPosition = 0;
                    foreach (var sentSplitLocation in BRR[0].sentLen) //BRR[0]: we're only passing in one value, we should only get one out
                    {
                        string currentSentence = value.Substring(currentPosition, sentSplitLocation);
                        if (sentences.Count > 0)
                        {
                            //Check to see if we will exceed the maxElementSize if we added this sentence
                            if (sentences.Last().Length + currentSentence.Length + 1 > maxElementSize)
                                sentences.Add(currentSentence); //would exceed the max, so just add it to the list separately
                            else
                                sentences[sentences.Count() - 1] += " " + currentSentence; //wouldn't exceed the max, append to the current list
                        }
                        else //first sentence, just add it
                            sentences.Add(currentSentence);
                        currentPosition += sentSplitLocation;
                    }

                    //loop through sentences, and add them to the list to be translated.
                    foreach (var sent in sentences) //make sure extractedSentence is under maxElementSize
                    {
                        if (currentListSize + sent.Length >= maxListSize || newList.Count >= groupSize)
                        {
                            result.Add(newList);
                            newList = new List<Tuple<int, string>>();
                            currentListSize = 0;
                        }
                        newList.Add(new Tuple<int, string>(currentLineNumber, sent));
                    }

                }
                else  //the element isn't too big for the list by itself...
                {
                    if (currentListSize + value.Length >= maxListSize) //next item will put the list over the max, start a new list
                    {
                        result.Add(newList);
                        newList = new List<Tuple<int, string>>();
                        currentListSize = 0;
                    }

                    BreakSentenceResponse[] BRR = TranslationServiceFacade.BreakSentences(new string[] { value });
                    List<string> sentences = new List<string>();
                    int currentPosition = 0;
                    foreach (var sentSplitLocation in BRR[0].sentLen) //we're only passing in one value, we should only get one out
                    {
                        string currentSentence = value.Substring(currentPosition, sentSplitLocation);
                        if (sentences.Count > 0)
                        {
                            if (sentences.Last().Length + currentSentence.Length + 1 > maxElementSize)
                                sentences.Add(currentSentence);
                            else
                                sentences[sentences.Count() - 1] += " " + currentSentence;
                        }
                        else
                            sentences.Add(currentSentence);
                        currentPosition += sentSplitLocation;
                    }

                    foreach (var sent in sentences) //make sure extractedSentence is under maxElementSize
                    {
                        if (currentListSize + sent.Length >= maxListSize || newList.Count >= groupSize)
                        {
                            result.Add(newList);
                            newList = new List<Tuple<int, string>>();
                            currentListSize = 0;
                        }
                        newList.Add(new Tuple<int, string>(currentLineNumber, sent));
                        currentListSize += sent.Length;
                    }
                }

            }
            else // element does not exceed the max element size
            {
                if (currentListSize + value.Length >= maxListSize) //make sure it won't exceed the current list size
                {
                    result.Add(newList);
                    newList = new List<Tuple<int, string>>();
                    currentListSize = 0;
                }

                newList.Add(new Tuple<int, string>(currentLineNumber, value));
                currentListSize += value.Length;
            }
            currentLineNumber++;
        }
        if (newList.Count > 0)
            result.Add(newList);

        foreach (var list in result)
        {

            int totalLength = 0;
            foreach (var item in list)
            {
                totalLength += item.Item2.Length;
            }
            if (totalLength > maxListSize)
            {
                logger.Error("list exceeds max. why?");
            }
        }

        return result;
    }

`

chriswendt1 commented 6 years ago

@jsypkens, thanks a lot for posting your code. I'll have a look over the weekend. Good use of BreakSentences. That'll be required.

chriswendt1 commented 5 years ago

Thanks, and finally fixed in the 2.3.0 release. I ended up not using your code. I left the SplitList functin mostly intact, added an exit criterion, and allowed segments that are larger than 5000 characters to be translated relardless, by inserting a splitter function that uses the server's BreakSentence method. I would love if you can give that a try and see whether it works for you.