Open daptatea opened 7 months ago
cc @srbalakr @mattgotteiner I think I saw this as well- I was noticing my citation filenames were missing page numbers and wondered where they went to.
Also seeing this. It looks to me like it is from the difference between the skillset document chunker.
The integrated vectorization does this where it maps source page to the filename in blob
index_projections = SearchIndexerIndexProjections(
selectors=[
SearchIndexerIndexProjectionSelector(
target_index_name=index_name,
parent_key_field_name="parent_id",
source_context="/document/pages/*",
mappings=[
InputFieldMappingEntry(name="content", source="/document/pages/*"),
InputFieldMappingEntry(name="embedding", source="/document/pages/*/vector"),
InputFieldMappingEntry(name="sourcepage", source="/document/metadata_storage_name"),
],
),
],
The original (non-integrated) maps the source page to the exact chunk within the sourcefile:
"sourcepage": (
BlobManager.blob_image_name_from_file_page(
filename=section.content.filename(),
page=section.split_page.page_num,
)
if image_embeddings
else BlobManager.sourcepage_from_file_page(
filename=section.content.filename(),
page=section.split_page.page_num,
)
),
In the index, integrated leaves it looking like this:
"sourcepage": "file.pdf",
"sourcefile": null,
Whereas the other searchmanager.py leaves it looking like this;
"sourcepage": "file-4.pdf",
"sourcefile": "file.pdf"
Hi there @pamelafox @mattgotteiner I'm looking for a solution to this issue. Is there a way to get the chunk page using integrated vectorization? Specifically, I'm trying to ensure that source page numbers are included in the index?
Any guidance or suggestions on how to achieve this with the integrated vectorization approach would be greatly appreciated.
Found these possible solutions, but both feel suboptimal to me
I asked the AI Search team about this and got a few suggestions:
are you planning in implementing it?
I am currently exploring the first suggestion @pamelafox provided and somehow merge it with this project to enrich the pipeline to get the page number. This approach will unfortunately add a lot of overhead and complexity.
Is there more information when the page number will be available to extract with Integrated Vectorization in this solution @pamelafox ?
@DiPbas @casperdamen123 https://github.com/Azure-Samples/azure-search-openai-demo/issues/1287#issuecomment-2394960531
"Each index projection document contains a unique identifying key that the indexer generates in order to ensure uniqueness and allow for change and deletion tracking to work correctly. This key his key contains the following segments: A random hash to guarantee uniqueness. This hash changes if the parent document is updated across indexer runs. The parent document's key. The enrichment annotation path that identifies the context that that document was generated from.
For example, if you split a parent document with key value "123" into four pages, and then each of those pages is projected as its own document via index projections, the key for the third page of text would look something like "01f07abfe7ed_123_pages_2".
If the parent document is then updated to add a fifth page, the new key for the third page might, for example, be "9d800bdacc0e_123_pages_2", since the random hash value changes between indexer runs even though the rest of the projection data didn't change."
Maybe we can use the Microsoft.Skills.Util.ShaperSkill to extract "sourcepage" from "sourcefile" by string pattern "(.pages(\d+)" or using input sourcing from /document/content/pages//number after the splitter?
@DiPbas @casperdamen123 #1287 (comment)
"Each index projection document contains a unique identifying key that the indexer generates in order to ensure uniqueness and allow for change and deletion tracking to work correctly. This key his key contains the following segments: A random hash to guarantee uniqueness. This hash changes if the parent document is updated across indexer runs. The parent document's key. The enrichment annotation path that identifies the context that that document was generated from.
For example, if you split a parent document with key value "123" into four pages, and then each of those pages is projected as its own document via index projections, the key for the third page of text would look something like "01f07abfe7ed_123_pages_2".
If the parent document is then updated to add a fifth page, the new key for the third page might, for example, be "9d800bdacc0e_123_pages_2", since the random hash value changes between indexer runs even though the rest of the projection data didn't change."
Maybe we can use the Microsoft.Skills.Util.ShaperSkill to extract "sourcepage" from "sourcefile" by string pattern "(_.pages(\d+)" or using input sourcing from /document/content/pages/_/number after the splitter?
Thats not true. The "_pages_X" only defines the chunk number, it doesnt specify the actual page number. How I solved it creating a custom skillset which does the chunking (you can use a custom text splitter or simply langchain/llama-index for this), and then also stores the page_num. Then this function returns the chunks and the page_num for each chunks and in this way I can store the actual page_num.
You mean with „it‘s not true“ that the Ms Docs are buggy? https://learn.microsoft.com/en-us/azure/search/index-projections-concept-intro?tabs=kstore-rest#projected-key-value
I meant that it is not true that you can solve it this way, because what they mean by "pages" here are "chunks", not actual pages of a PDF.
This issue is for a: (mark with an
x
)Minimal steps to reproduce
Any log messages given by the failure
Expected/desired behavior
OS and Version?
azd version?
Versions
Mention any other details that might be useful