Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.96k stars 4.09k forks source link

Error on integration with SharePoint online document library #304

Open nitedan opened 1 year ago

nitedan commented 1 year ago

Please provide us with the following information: Hello,

I try to integrate with SharePoint online document library using https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online , the deployment was ok no error but when i try to access the chat i got following error in backend page "error source page "

https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online

Index structure

{ "name" : "sharepoint-index", "fields": [ { "name": "id", "type": "Edm.String", "key": true, "searchable": false }, { "name": "metadata_spo_item_name", "type": "Edm.String", "key": false, "searchable": true, "filterable": false, "sortable": false, "facetable": false }, { "name": "metadata_spo_item_path", "type": "Edm.String", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false }, { "name": "metadata_spo_item_content_type", "type": "Edm.String", "key": false, "searchable": false, "filterable": true, "sortable": false, "facetable": true }, { "name": "metadata_spo_item_last_modified", "type": "Edm.DateTimeOffset", "key": false, "searchable": false, "filterable": false, "sortable": true, "facetable": false }, { "name": "metadata_spo_item_size", "type": "Edm.Int64", "key": false, "searchable": false, "filterable": false, "sortable": false, "facetable": false }, { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false } ] }

36955665-c9f3-41c2-9b74-fc93fc361322


This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful


Thanks! We'll be in touch soon. MicrosoftTeams-image (85) MicrosoftTeams-image (84) MicrosoftTeams-image (83)

nitedan commented 1 year ago

Hello,

Someone try to integrate with with SharePoint online document library

Mention

rene-haskia commented 1 year ago

I got this to work. You need to replace environment variables with fields provided by your sharepoint index: Have a look at my examples: KB_FIELDS_CONTENT="content" KB_FIELDS_CATEGORY="metadata_spo_item_content_type" KB_FIELDS_SOURCEPAGE="metadata_spo_item_weburi"

I am still tweaking the semantic fields and configuration on this for better performance and ouput. When content has many line breaks it is often too big and causes the token limit message.

Feel free to connect when you get it to work. I'd love to learn more about best practices for connecting to SharePoint. 👉https://www.linkedin.com/in/ren%C3%A9-haskia-1381b7208

SiPearson commented 1 year ago

KB_FIELDS_CONTENT="content" KB_FIELDS_CATEGORY="metadata_spo_item_content_type" KB_FIELDS_SOURCEPAGE="metadata_spo_item_weburi"

HI, I'm getting the Error: 'sourcepage' after creating a SharePoint datasource and indexer and added a semantic configuration, where do I add the env vars?

thank you :)

rene-haskia commented 1 year ago

Try adding the environment variables in Azure Portal and restart the web app.

Go here: Your App Service > Configuration > Application Settings

dGanguly1 commented 1 year ago

Hi. I'm trying to implement the same, but I have a follow-up question to this. My cognitive-search service has multiple indexes (one for sharepoint, one for azure data lake, etc). How (or if) can I modify the code to query against multiple indexes instead of just using gptkbindex?

rene-haskia commented 1 year ago

Hi. I'm trying to implement the same, but I have a follow-up question to this. My cognitive-search service has multiple indexes (one for sharepoint, one for azure data lake, etc). How (or if) can I modify the code to query against multiple indexes instead of just using gptkbindex?

I have not tested this myself, but I think this repo is a good starting point for adding multiple indexes/knowledge bases: https://github.com/Azure-Samples/openai/tree/main/End_to_end_Solutions/AOAISearchDemo

SiPearson commented 1 year ago

When content has many line breaks it is often too big and causes the token limit message.

I'm seeing this a lot,

Error: This model's maximum context length is 8193 tokens, however you requested 10996 tokens (9972 in your prompt; 1024 for the completion). Please reduce your prompt; or completion length.

Is that what you're talking about?

I can ask the same question and one time it works perfectly, the next time I see this error. Is that what you're seeing too. This is using SharePoint as the source for the index, if I have the same documents in a folder and process them the same doesn't occur. Seems to me that the documents need to be broken down rather than just indexing whole PDF's.

PfisterAn commented 1 year ago

Having the same challenge; was looking into the concept of skillsets in the searchservice, where a splittext skill could help. But essentially the content may get garbled; not sure if there is a good way to address it. Perhaps also the app may be able to create chunks and feed the info separatly.

10k tokens would not be a big deal, it may be a good idea to change to a model which can deal with more tokens (up to 32k should be possible with gpt-4-32k). In addition I was thinking of training the model with the additional content (if that would even be feasible or possible), but with that approach the ability to reference to the sources would disappear.

rene-haskia commented 1 year ago

I just wanted to share that gpt-4-32k returns better results - and using skillsets to create page keywords for the semantic configuration helped a bit. However, I have the feeling that this solution does not match the results which come from files in Blob Storage - now that you can create embeddings on the fly, too. Also, different languages are not working well.

Maybe the approach on this page does the trick, when you combine it with a SharePoint Indexer: https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.