Closed nekozen closed 1 year ago
Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index.
In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings"
The chunked content is written into a field called "text" and the embeddings are a top-level vector representation
Thank you for the reply. Having duplicate indexes seems useless for my use case, so I'm going to think about chunking the data of the data source without using a custom skill.
Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index.
In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings"
The chunked content is written into a field called "text" and the embeddings are a top-level vector representation
@arv100kri , in the notebook it says, that this will not scale up for large dataset. I am facing a timeout issue in azure cognitive search. There's a note saying how to solve this by having the index run in parallel. I tried that but it's still timing out after 2 hours.
For context, my data is 2000+ PDF Files that varies in pages and content.
Any advise on how I can move forward and speed up this process?
Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index.
In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings"
The chunked content is written into a field called "text" and the embeddings are a top-level vector representation
Thank you for the reply. Having duplicate indexes seems useless for my use case, so I'm going to think about chunking the data of the data source without using a custom skill.
Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index. https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings" The chunked content is written into a field called "text" and the embeddings are a top-level vector representation
You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index
@arv100kri , in the notebook it says, that this will not scale up for large dataset. I am facing a timeout issue in azure cognitive search. There's a note saying how to solve this by having the index run in parallel. I tried that but it's still timing out after 2 hours.
For context, my data is 2000+ PDF Files that varies in pages and content.
Any advise on how I can move forward and speed up this process?
Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index. https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings" The chunked content is written into a field called "text" and the embeddings are a top-level vector representation
Have you tried setting a schedule for the indexer? If the time out is hit after processing a few documents, having a scheduled indexer would be nice to make continuous forward progress... the 2 hour limit is unfortunately the case when using skills
@arv100kri, thank you for your reply, but I don't understand how to combinde the indexes.
You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index
I found the following issue and collection of vector fields may be supported in the future. If this feature is supported,the custom skill should just add a vector list field to the document, I think.
https://github.com/Azure/cognitive-search-vector-pr/issues/22
@arv100kri, when a schedule is defined. Do you know if it "restarts" from the previous runs? or does it index documents from the start again?
You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index
Hi - could you help me with this guidance? I'm trying to colocate the indexes using this very helpful notebook. I was creating a complex collection and tried to put chunks as a collection with the embedding in there. When you say combine the schemas... one index is at the document level, and one is at the chunk level - they have different Ids, how do you co-locate them? Especially when a document is deleted, I want to make sure the chunks also get deleted.
You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index
Hi - could you help me with this guidance? I'm trying to colocate the indexes using this very helpful notebook. I was creating a complex collection and tried to put chunks as a collection with the embedding in there. When you say combine the schemas... one index is at the document level, and one is at the chunk level - they have different Ids, how do you co-locate them? Especially when a document is deleted, I want to make sure the chunks also get deleted.
What @arv100kri was suggesting was that your parent documents and chunked documents would live in the same index. This is assuming that they would have different key values (since each chunked document is a more specific version of each parent document, it could just be the parent key + some extra unique identifier for the chunk) so they wouldn't conflict with one another. You would still need two indexers and the intermediary knowledge store in order for this to work, as we don't currently support a way to go from one document during skillset execution to multiple documents in the search index.
I personally would recommend to just use the notebook as is, but if you don't care about the parent/source documents, just ignore that index. You can even delete it once you are done indexing all of your data if you want, it just needs to exist so that the first indexer that creates the embeddings and writes them to the knowledge store can function correctly.
You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index
Hi - could you help me with this guidance? I'm trying to colocate the indexes using this very helpful notebook. I was creating a complex collection and tried to put chunks as a collection with the embedding in there. When you say combine the schemas... one index is at the document level, and one is at the chunk level - they have different Ids, how do you co-locate them? Especially when a document is deleted, I want to make sure the chunks also get deleted.
What @arv100kri was suggesting was that your parent documents and chunked documents would live in the same index. This is assuming that they would have different key values (since each chunked document is a more specific version of each parent document, it could just be the parent key + some extra unique identifier for the chunk) so they wouldn't conflict with one another. You would still need two indexers and the intermediary knowledge store in order for this to work, as we don't currently support a way to go from one document during skillset execution to multiple documents in the search index.
I personally would recommend to just use the notebook as is, but if you don't care about the parent/source documents, just ignore that index. You can even delete it once you are done indexing all of your data if you want, it just needs to exist so that the first indexer that creates the embeddings and writes them to the knowledge store can function correctly.
Very helpful - thanks for clarifying. I think the notebook as is, is actually great with just the issue of deleting. I guess there's a world where you can delete from the knowledge store based on the parent index deleting a document, then this all works. Is that a pattern that's there?
Yes, the knowledge store supports automatic deletion if anything changes in the parent document, so long as your initial indexer is still running. It can also support deletion of all projections for a particular document that may have been completely removed if the data source has a deletion detection policy set. Note that you will also need to make sure that the second indexer's datasource (the one that looks at the projections from the knowledge store) has a deletion detection policy set if you want the projections that get deleted by the first indexer to be reflected in the chunked index.
All that being said, we definitely know that the current method of handling documents that need to be chunked is cumbersome. We are working on ways to make this easier, so please stayed tuned for upcoming feature announcements!
This issue is for a: (mark with an
x
)Minimal steps to reproduce
-> cannot save
Any log messages given by the failure
Expected/desired behavior
EmbeddingGenerator: https://github.com/Azure-Samples/azure-search-power-skills/tree/main/Vector/EmbeddingGenerator
OS and Version?
Versions
Mention any other details that might be useful