Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.92k stars 4.05k forks source link

Indexing pdf documents which are not in English #689

Open suma-sai-paluri opened 12 months ago

suma-sai-paluri commented 12 months ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [ ] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

I have a few inquiries unrelated to any bugs but rather some uncertainties. I possess PDF documents in Italian, and I aim to utilize this repository for querying them. Upon deploying this repo, I found the results to be satisfactory when utilizing semantic ranker mode, fine with vectors, but unsatisfactory with pure text. My objective is to achieve favorable results using only embeddings or vectors and decent results with text, as the number of semantic searches available is currently limited. I'd appreciate assistance with the following questions:

1) What is the optimal method for indexing? Should I opt for a language-specific analyzer (such as Lucene Italian) or a language-agnostic one (like standard Lucene)?

2) Would it be advisable to modify the prompt instructions, specifying not to translate the queries into English, given that my data is in Italian?

3) My documents also contain a substantial amount of tables and images in Italian. Will the form recognizer be sufficiently effective in extracting information from them?

Please let me know if you have any other suggestions which you think might be helpful in my case.

Any log messages given by the failure

Expected/desired behavior

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

Anyone who had experience using a data repository with content in a language other than English and featuring a significant amount of images and tables, It would be greatly helpful hearing about your experiences or any modifications you made that yielded positive results. Your insights would be invaluable. Thanks! We'll be in touch soon.

srbalakr commented 12 months ago

Hi @suma-sai-paluri Did you try querying with both Vector + Text (Hybrid) option and semantic ranker ?

Regarding your questions

  1. I would suggest you to try both and see which option helps your need. Standard lucence is by default and should work well too.
  2. Reason asked to translate was, when data is in English and user query is in non English. You can modify the prompt and see if it yields better search query.
  3. It should be able to process. @pamelafox for more info.
suma-sai-paluri commented 11 months ago

Screenshot 2023-10-04 092554 Hi @srbalakr Thank you for your suggestions, I am experimenting with the above ideas and would update once i encounter decent results. But with one of the latest commits 22a45dc is not giving any responses in the chat scenario but answers in the Ask scenario. Is this commit22a45dc deployable in west Europe location? Because in the main.bicep file the location list does not contain WestEurope and I made the changes to that list. Thank you

ks6088ts commented 10 months ago

Hi @suma-sai-paluri I've just added feature for supporting non English languages, ref #780 You can configure language settings via

# To use Japanese languages -> Please 
azd env set AZURE_SEARCH_QUERY_LANGUAGE {Name of query language}
azd env set AZURE_SEARCH_QUERY_SPELLER none
azd env set AZURE_SEARCH_ANALYZER_NAME {Name of analyzer name}

I hope it helps :)

suma-sai-paluri commented 10 months ago

Hi @ks6088ts thank you for adding this feature. I am excited to work with it. Thank you

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.