Azure-Samples / azure-search-openai-demo-java

This repo is the Java version of Microsoft's sample app for ChatGPT + Enterprise data.
MIT License
67 stars 66 forks source link

Text splitting doesn't work with tiny documents #78

Closed tonybaloney closed 3 months ago

tonybaloney commented 3 months ago

When an indexed document has less than 1000 characters, the text splitter will not yield any pages and nothing is sent to the search service.

https://github.com/Azure-Samples/azure-search-openai-demo-java/blob/166ffda8f86e67830292724f4bf2322e26d9cb8f/app/indexer/core/src/main/java/com/microsoft/openai/samples/indexer/parser/TextSplitter.java#L59

This is the fix for the Python sample from which this function was based on https://github.com/Azure-Samples/azure-search-openai-demo/commit/e835da37aead8add52d210a7593663ce3c928229

dantelmomsft commented 3 months ago

@tonybaloney I've tried with a pdf containing 754 characters and it worked. Can you share the pdf document you have used?

tonybaloney commented 3 months ago

Submitted a failing test to repro this in #81 and I'll add a patch as well. I don't think the tests are being run as part of the GitHub actions for PRs?

dantelmomsft commented 3 months ago

@tonybaloney I've figured out the problem: it happens for documents with less then 100 characters (due to the overlap parameter). I've reviewed your PR and merged.

tonybaloney commented 3 months ago

Thanks! I couldn't figure out in the PR how to get the tests to run from the CI workflow. Please can you take a look. I can run them locally