Wrong sourcepage, when section include text from two pages

jomieljaniuk commented 1 year ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [x ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Run app with some pdf data that will be splited in pages After indexing pages to Azure Search, find section that include text from the ending of x page and beginning of x+1 page Ask chatbot about info related to that section from page x+1 Chat will respond correctly but in citation will print pdf from page x, but our info is in page x+1

Expected/desired behavior

Dividing pdf (our data) into sections to be indexed in Azure Search in prepdocs.py file should consider end of page. Information from page x and x+1 should be in separate sections.

Mention any other details that might be useful

Function find_page in file prepdocs.py is looking for page when is the beginng of the section, but do not consider that section can end in next page.

theafgov commented 1 year ago

Did you find an answer to this? I'm having the same issue

jomieljaniuk commented 1 year ago

You have to change prepdocs.py script to split text in range of one page:

# make split_text for each page separately
def split_text(page_map):
    SENTENCE_ENDINGS = [".", "!", "?"]
    WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]
    if args.verbose: print(f"Splitting '{filename}' into sections")

    # make split_text for each page separately
    for (page_num, _, page_text) in page_map:
        length = len(page_text)
        start = 0
        end = length
        while start + SECTION_OVERLAP < length:

Then, each text section as a source will be come from only one page.

hello-lui commented 1 year ago

You have to change prepdocs.py script to split text in range of one page:

# make split_text for each page separately
def split_text(page_map):
    SENTENCE_ENDINGS = [".", "!", "?"]
    WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]
    if args.verbose: print(f"Splitting '{filename}' into sections")

    # make split_text for each page separately
    for (page_num, _, page_text) in page_map:
        length = len(page_text)
        start = 0
        end = length
        while start + SECTION_OVERLAP < length:

Then, each text section as a source will be come from only one page.

@jomieljaniuk I'm having the exact same problem too. Does the modified source work well? Can I see the full contents of the modified prepdocs.py file?

Thank you in advance. ^^

igforce commented 1 year ago

Some additional comments :

The page seems to start at page 0 and so is always off by one [Not a deal breaker]
Using the localpdfparser seems to resolve the issue for the most part where the index content follows the actual page content. But it does happen that it is off on some pages. Not sure whether we get better result with the document intelligence formrecognizer service vs the localpdfparser. I would assume formrecognizer gives better result..
Is there a full PR fix for this ? @pamelafox appreciate if you can have a peek at this issue. Thanks

ivanvaccarics commented 1 year ago

Hi @hello-lui, I revised the code to create sections only for single page. Moreover, I removed the OVERLAP by setting it to 0. Here the code for people interested on this topic.

def split_text(page_map):
    SENTENCE_ENDINGS = [".", "!", "?"]
    WORDS_BREAKS = [",", ";", ":", " ", "(", ")", "[", "]", "{", "}", "\t", "\n"]

    # make split_text for each page separately
    for (page_num, _, page_text) in page_map:
        length = len(page_text)
        start = 0
        end = length
        all_text = page_text
        while start + SECTION_OVERLAP < length:
            last_word = -1
            end = start + MAX_SECTION_LENGTH

            if end > length:
                end = length
            else:
                # Try to find the end of the sentence
                while (
                    end < length
                    and (end - start - MAX_SECTION_LENGTH) < SENTENCE_SEARCH_LIMIT
                    and all_text[end] not in SENTENCE_ENDINGS
                ):
                    if all_text[end] in WORDS_BREAKS:
                        last_word = end
                    end += 1
                if end < length and all_text[end] not in SENTENCE_ENDINGS and last_word > 0:
                    end = last_word  # Fall back to at least keeping a whole word
            if end < length:
                end += 1

            # Try to find the start of the sentence or at least a whole word boundary
            last_word = -1
            while (
                start > 0
                and start > end - MAX_SECTION_LENGTH - 2 * SENTENCE_SEARCH_LIMIT
                and all_text[start] not in SENTENCE_ENDINGS
            ):
                if all_text[start] in WORDS_BREAKS:
                    last_word = start
                start -= 1
            if all_text[start] not in SENTENCE_ENDINGS and last_word > 0:
                start = last_word
            if start > 0:
                start += 1

            section_text = all_text[start:end]
            yield (section_text, page_num)

            last_table_start = section_text.rfind("<table")
            if last_table_start > 2 * SENTENCE_SEARCH_LIMIT and last_table_start > section_text.rfind("</table"):
                # If the section ends with an unclosed table, we need to start the next section with the table.
                # If table starts inside SENTENCE_SEARCH_LIMIT, we ignore it, as that will cause an infinite loop for tables longer than MAX_SECTION_LENGTH
                # If last table starts inside SECTION_OVERLAP, keep overlapping
                start = min(end - SECTION_OVERLAP, start + last_table_start)
            else:
                start = end - SECTION_OVERLAP

        if start + SECTION_OVERLAP < end:
            yield (all_text[start:end], page_num)

elhele commented 12 months ago

ivanvaccarics

Thank you for the code! It seems to be working well to get the correct citation, but doesn't it also remove the continuity in the text? If I have some text that starts on page x and ends on page x+1, with the updated code I'm only getting the information from page x. With the initial code the answer was complete.

I mean something like this: Page x: .... Here is the list of programming languages:

Python
Java

Page x+1:

C++
Go ...

With a prompt "Please show me the programming languages" I was getting "Python, Java, C++, Go" with the initial code and only "Python, Java" with this version

ivanvaccarics commented 12 months ago

Hi @elhele yes you are right but i think the approach to maintain continuity or not dependes by the use case. In my use case, by maintaining continuity, the text was not correctly parsed and also it was difficult to the page the correct page for the citation (the file contains a lot of table and HTML tags). You should decide the approach based on the use case you are working on.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

Azure-Samples / azure-search-openai-demo