Using indexers ingestion / Integrated Vectorization will not apply page number

Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.

https://azure.microsoft.com/products/search

MIT License

6.05k stars 4.14k forks source link

Using indexers ingestion / Integrated Vectorization will not apply page number #1380

Open daptatea opened 7 months ago

daptatea commented 7 months ago

Please provide us with the following information:

This issue is for a: (mark with an `x`)

- [ ] bug report -> please search issues before submitting
- [x ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

By using Indexers to ingest from storage account, source page will not be added compared to using prepdocs.ps1. Is there a way to add source page with the indexers ingestion of pdf?

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Thanks! We'll be in touch soon.

pamelafox commented 7 months ago

cc @srbalakr @mattgotteiner I think I saw this as well- I was noticing my citation filenames were missing page numbers and wondered where they went to.

jakebowles99 commented 6 months ago

Also seeing this. It looks to me like it is from the difference between the skillset document chunker.

The integrated vectorization does this where it maps source page to the filename in blob

        index_projections = SearchIndexerIndexProjections(
            selectors=[
                SearchIndexerIndexProjectionSelector(
                    target_index_name=index_name,
                    parent_key_field_name="parent_id",
                    source_context="/document/pages/*",
                    mappings=[
                        InputFieldMappingEntry(name="content", source="/document/pages/*"),
                        InputFieldMappingEntry(name="embedding", source="/document/pages/*/vector"),
                        InputFieldMappingEntry(name="sourcepage", source="/document/metadata_storage_name"),
                    ],
                ),
            ],

The original (non-integrated) maps the source page to the exact chunk within the sourcefile:

                        "sourcepage": (
                            BlobManager.blob_image_name_from_file_page(
                                filename=section.content.filename(),
                                page=section.split_page.page_num,
                            )
                            if image_embeddings
                            else BlobManager.sourcepage_from_file_page(
                                filename=section.content.filename(),
                                page=section.split_page.page_num,
                            )
                        ),

In the index, integrated leaves it looking like this:

      "sourcepage": "file.pdf",
      "sourcefile": null,

Whereas the other searchmanager.py leaves it looking like this;

      "sourcepage": "file-4.pdf",
      "sourcefile": "file.pdf"

luixlacrux commented 3 months ago

Hi there @pamelafox @mattgotteiner I'm looking for a solution to this issue. Is there a way to get the chunk page using integrated vectorization? Specifically, I'm trying to ensure that source page numbers are included in the index?

Any guidance or suggestions on how to achieve this with the integrated vectorization approach would be greatly appreciated.

casperdamen123 commented 2 months ago

Found these possible solutions, but both feel suboptimal to me

pamelafox commented 1 month ago

I asked the AI Search team about this and got a few suggestions:

Through AI Document intelligence, using a custom skill, they would just need to modify this to not only use this vs. splitting the docs with split skill, but they can get the page number as required: azure-search-vector-samples/demo-python/code/indexers/document-intelligence-custom-skill/document-intelligence-custom-skill.ipynb at main · Azure/azure-search-vector-samples (github.com)
Through their own custom skill, using the library/method of their preference to extract the data, they could use GPT-4o as data extraction method with a custom skill and change the prompt to retrieve the page too. They can use this sample code for the extraction and adapt the outputs for a custom skill: liamca/GPT4oContentExtraction: Using Azure OpenAI GPT 4o to extract information such as text, tables and charts from Documents to Markdown (github.com)

ogimgio commented 1 month ago

are you planning in implementing it?

DiPbas commented 3 weeks ago

I am currently exploring the first suggestion @pamelafox provided and somehow merge it with this project to enrich the pipeline to get the page number. This approach will unfortunately add a lot of overhead and complexity.

Is there more information when the page number will be available to extract with Integrated Vectorization in this solution @pamelafox ?

cforce commented 3 weeks ago

@DiPbas @casperdamen123 https://github.com/Azure-Samples/azure-search-openai-demo/issues/1287#issuecomment-2394960531

"Each index projection document contains a unique identifying key that the indexer generates in order to ensure uniqueness and allow for change and deletion tracking to work correctly. This key his key contains the following segments: A random hash to guarantee uniqueness. This hash changes if the parent document is updated across indexer runs. The parent document's key. The enrichment annotation path that identifies the context that that document was generated from.

For example, if you split a parent document with key value "123" into four pages, and then each of those pages is projected as its own document via index projections, the key for the third page of text would look something like "01f07abfe7ed_123_pages_2".

If the parent document is then updated to add a fifth page, the new key for the third page might, for example, be "9d800bdacc0e_123_pages_2", since the random hash value changes between indexer runs even though the rest of the projection data didn't change."

Maybe we can use the Microsoft.Skills.Util.ShaperSkill to extract "sourcepage" from "sourcefile" by string pattern "(.pages(\d+)" or using input sourcing from /document/content/pages//number after the splitter?

ogimgio commented 2 weeks ago

@DiPbas @casperdamen123 #1287 (comment)

"Each index projection document contains a unique identifying key that the indexer generates in order to ensure uniqueness and allow for change and deletion tracking to work correctly. This key his key contains the following segments: A random hash to guarantee uniqueness. This hash changes if the parent document is updated across indexer runs. The parent document's key. The enrichment annotation path that identifies the context that that document was generated from.

For example, if you split a parent document with key value "123" into four pages, and then each of those pages is projected as its own document via index projections, the key for the third page of text would look something like "01f07abfe7ed_123_pages_2".

If the parent document is then updated to add a fifth page, the new key for the third page might, for example, be "9d800bdacc0e_123_pages_2", since the random hash value changes between indexer runs even though the rest of the projection data didn't change."

Maybe we can use the Microsoft.Skills.Util.ShaperSkill to extract "sourcepage" from "sourcefile" by string pattern "(_.pages(\d+)" or using input sourcing from /document/content/pages/_/number after the splitter?

Thats not true. The "_pages_X" only defines the chunk number, it doesnt specify the actual page number. How I solved it creating a custom skillset which does the chunking (you can use a custom text splitter or simply langchain/llama-index for this), and then also stores the page_num. Then this function returns the chunks and the page_num for each chunks and in this way I can store the actual page_num.

cforce commented 2 weeks ago

You mean with „it‘s not true“ that the Ms Docs are buggy? https://learn.microsoft.com/en-us/azure/search/index-projections-concept-intro?tabs=kstore-rest#projected-key-value