Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.91k stars 4.05k forks source link

Our citation code assumes PNGs are based on PDFs #1539

Open pamelafox opened 5 months ago

pamelafox commented 5 months ago

If you upload a PNG, which can be OCRed fine with the new Document Intelligence, and then ask a question on it, you'll see this error:

Traceback (most recent call last):
  File "/workspaces/azure-search-openai-demo/app/backend/app.py", line 180, in format_as_ndjson
    async for event in r:
  File "/workspaces/azure-search-openai-demo/app/backend/approaches/chatapproach.py", line 152, in run_with_streaming
    extra_info, chat_coroutine = await self.run_until_final_call(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/azure-search-openai-demo/app/backend/approaches/chatreadretrieveread.py", line 168, in run_until_final_call
    sources_content = self.get_sources_content(results, use_semantic_captions, use_image_citation=False)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/azure-search-openai-demo/app/backend/approaches/approach.py", line 201, in get_sources_content
    return [
           ^
  File "/workspaces/azure-search-openai-demo/app/backend/approaches/approach.py", line 202, in <listcomp>
    (self.get_citation((doc.sourcepage or ""), use_image_citation)) + ": " + nonewlines(doc.content or "")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/azure-search-openai-demo/app/backend/approaches/approach.py", line 213, in get_citation
    page_number = int(path[page_idx + 1 :])
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: 'pane'

That's due to this code:

def get_citation(self, sourcepage: str, use_image_citation: bool) -> str:
    if use_image_citation:
        return sourcepage
    else:
        path, ext = os.path.splitext(sourcepage)
        if ext.lower() == ".png":
            page_idx = path.rfind("-")
            page_number = int(path[page_idx + 1 :])
            return f"{path[:page_idx]}.pdf#page={page_number}"

        return sourcepage

That made sense when we only supported PDFs and all PNGs were PNGified versions of PDFs, but now is not compatible with someone who just wants to plain upload PNGs.

The solution might be to pass in sourcefile as well, as I think that might still be PDF in the case of vision? Needs some experimentation.

bnodir commented 5 months ago

page_number = int(path[page_idx + 1 :])

It jumps to the correct page when I change above line to the following:

page_number = int(path[page_idx + 1 :]) + 1

Is it because the PNGified file starts with a 0 index, like this? Sorry, it has nothing to do with the PNG error, however.

Uploaded file name Benefit_Options-0.png Benefit_Options-1.png Benefit_Options-2.png Benefit_Options-3.png Benefit_Options.pdf