Azure-Samples / azure-search-openai-demo-csharp

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure Cognitive Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
MIT License
612 stars 401 forks source link

Ensure chunked PDF documents are never bigger than 500 tokens, support CJK and fix bug with tiny documents #303

Open tonybaloney opened 6 months ago

tonybaloney commented 6 months ago

Purpose

  1. Better support CJK documents with ideographic and full-width unicode punctuation marks.
  2. Implement a recursive character splitting algorithm to make sure that all sections are < 500 tokens (the limit for Azure AI Search for this model)
  3. Also fixes #304

Both changes are based on improvements made to the Python sample

Does this introduce a breaking change?

[ ] Yes
[x] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix [x] Feature [ ] Code style update (formatting, local variables) [ ] Refactoring (no functional changes, no api changes) [ ] Documentation content changes [ ] Other... Please describe:

How to Test

git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install

What to Check

Verify that the following are valid

Other Information

tonybaloney commented 6 months ago

The https://github.com/Azure-Samples/azure-search-openai-demo-csharp/pull/303/commits/e734ef135bccf89c5bcb11ab155646869f124a88 commit should fail, I added a test to prove #304

luisquintanilla commented 6 months ago

The changes in this PR are being introduced in SK as part of microsoft/semantic-kernel#5489

Once that's merged, this PR will be updated to reflect those changes.

cc: @tonybaloney

LittleLittleCloud commented 6 months ago

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

tonybaloney commented 6 months ago

@tonybaloney Looks like the sk PR has been merged, would you still planning to update this PR to reflect that change

Yes, I'll wait for a new release of SK so I can test the changes