Azure-Samples / azure-search-power-skills

A collection of useful functions to be deployed as custom skills for Azure Cognitive Search
MIT License
285 stars 170 forks source link

How to define "chunks" field of index for EmbeddingGenerator #119

Closed nekozen closed 1 year ago

nekozen commented 1 year ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [x] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

  1. Select an index of Cognitive Search on Azure portal.
  2. Select "Edit JSON" button.
  3. Add the following field in fields array and save.
    {
    "name": "chunks",
    "type": "Collection(Edm.ComplexType)",
    "analyzer": null,
    "synonymMaps": [],
    "fields": [
    {
        "name": "contentVector",
        "type": "Collection(Edm.Single)",
        "searchable": true,
        "retrievable": true,
        "dimensions": 1536,
        "vectorSearchConfiguration": "vectorConfig"
    }
    ]
    },

    -> cannot save

Any log messages given by the failure

Error message : "InvalidField: The field 'chunks/contentVector' cannot be a vector field. Only a top-level field of the index can be a vector field. Parameters: definition"

Expected/desired behavior

Please let me know how to define "chunks" field of index for EmbeddingGenerator. Maybe the above is wrong. I think "chunks" field should be added for EmbeddingGenerator.

EmbeddingGenerator: https://github.com/Azure-Samples/azure-search-power-skills/tree/main/Vector/EmbeddingGenerator

OS and Version?

Azure portal on windows Edge.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon.

arv100kri commented 1 year ago

Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index.

https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb

In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings"

The chunked content is written into a field called "text" and the embeddings are a top-level vector representation

nekozen commented 1 year ago

Thank you for the reply. Having duplicate indexes seems useless for my use case, so I'm going to think about chunking the data of the data source without using a custom skill.

Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index.

https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb

In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings"

The chunked content is written into a field called "text" and the embeddings are a top-level vector representation

miggytrinidad commented 1 year ago

@arv100kri , in the notebook it says, that this will not scale up for large dataset. I am facing a timeout issue in azure cognitive search. There's a note saying how to solve this by having the index run in parallel. I tried that but it's still timing out after 2 hours.

For context, my data is 2000+ PDF Files that varies in pages and content.

Any advise on how I can move forward and speed up this process?

Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index.

https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb

In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings"

The chunked content is written into a field called "text" and the embeddings are a top-level vector representation

arv100kri commented 1 year ago

Thank you for the reply. Having duplicate indexes seems useless for my use case, so I'm going to think about chunking the data of the data source without using a custom skill.

Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index. https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings" The chunked content is written into a field called "text" and the embeddings are a top-level vector representation

You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index

arv100kri commented 1 year ago

@arv100kri , in the notebook it says, that this will not scale up for large dataset. I am facing a timeout issue in azure cognitive search. There's a note saying how to solve this by having the index run in parallel. I tried that but it's still timing out after 2 hours.

For context, my data is 2000+ PDF Files that varies in pages and content.

Any advise on how I can move forward and speed up this process?

Can you please take a look at this sample which utilizes the powerskill and writes chunks + embeddings to a different index. https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb In particular look for the section: "Utilities to manage the "chunked" document index - with vector embeddings" The chunked content is written into a field called "text" and the embeddings are a top-level vector representation

Have you tried setting a schedule for the indexer? If the time out is hit after processing a few documents, having a scheduled indexer would be nice to make continuous forward progress... the 2 hour limit is unfortunately the case when using skills

nekozen commented 1 year ago

@arv100kri, thank you for your reply, but I don't understand how to combinde the indexes.

You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index

I found the following issue and collection of vector fields may be supported in the future. If this feature is supported,the custom skill should just add a vector list field to the document, I think.

https://github.com/Azure/cognitive-search-vector-pr/issues/22

miggytrinidad commented 1 year ago

@arv100kri, when a schedule is defined. Do you know if it "restarts" from the previous runs? or does it index documents from the start again?

marcrousset-ow commented 1 year ago

You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index

Hi - could you help me with this guidance? I'm trying to colocate the indexes using this very helpful notebook. I was creating a complex collection and tried to put chunks as a collection with the embedding in there. When you say combine the schemas... one index is at the document level, and one is at the chunk level - they have different Ids, how do you co-locate them? Especially when a document is deleted, I want to make sure the chunks also get deleted.

Careyjmac commented 1 year ago

You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index

Hi - could you help me with this guidance? I'm trying to colocate the indexes using this very helpful notebook. I was creating a complex collection and tried to put chunks as a collection with the embedding in there. When you say combine the schemas... one index is at the document level, and one is at the chunk level - they have different Ids, how do you co-locate them? Especially when a document is deleted, I want to make sure the chunks also get deleted.

What @arv100kri was suggesting was that your parent documents and chunked documents would live in the same index. This is assuming that they would have different key values (since each chunked document is a more specific version of each parent document, it could just be the parent key + some extra unique identifier for the chunk) so they wouldn't conflict with one another. You would still need two indexers and the intermediary knowledge store in order for this to work, as we don't currently support a way to go from one document during skillset execution to multiple documents in the search index.

I personally would recommend to just use the notebook as is, but if you don't care about the parent/source documents, just ignore that index. You can even delete it once you are done indexing all of your data if you want, it just needs to exist so that the first indexer that creates the embeddings and writes them to the knowledge store can function correctly.

marcrousset-ow commented 1 year ago

You can also do it with just the one index if you can combine the schema of the "source" and "chunk" indexes - I didn't include that in the notebook, but you can effectively have just the 1 index

Hi - could you help me with this guidance? I'm trying to colocate the indexes using this very helpful notebook. I was creating a complex collection and tried to put chunks as a collection with the embedding in there. When you say combine the schemas... one index is at the document level, and one is at the chunk level - they have different Ids, how do you co-locate them? Especially when a document is deleted, I want to make sure the chunks also get deleted.

What @arv100kri was suggesting was that your parent documents and chunked documents would live in the same index. This is assuming that they would have different key values (since each chunked document is a more specific version of each parent document, it could just be the parent key + some extra unique identifier for the chunk) so they wouldn't conflict with one another. You would still need two indexers and the intermediary knowledge store in order for this to work, as we don't currently support a way to go from one document during skillset execution to multiple documents in the search index.

I personally would recommend to just use the notebook as is, but if you don't care about the parent/source documents, just ignore that index. You can even delete it once you are done indexing all of your data if you want, it just needs to exist so that the first indexer that creates the embeddings and writes them to the knowledge store can function correctly.

Very helpful - thanks for clarifying. I think the notebook as is, is actually great with just the issue of deleting. I guess there's a world where you can delete from the knowledge store based on the parent index deleting a document, then this all works. Is that a pattern that's there?

Careyjmac commented 1 year ago

Yes, the knowledge store supports automatic deletion if anything changes in the parent document, so long as your initial indexer is still running. It can also support deletion of all projections for a particular document that may have been completely removed if the data source has a deletion detection policy set. Note that you will also need to make sure that the second indexer's datasource (the one that looks at the projections from the knowledge store) has a deletion detection policy set if you want the projections that get deleted by the first indexer to be reflected in the chunked index.

All that being said, we definitely know that the current method of handling documents that need to be chunked is cumbersome. We are working on ways to make this easier, so please stayed tuned for upcoming feature announcements!