TanGentleman / Augmenta

Automate RAG-powered workflows
MIT License
1 stars 0 forks source link

Add semantic chunking #21

Open TanGentleman opened 2 months ago

TanGentleman commented 2 months ago

I want to this alongside the migration to Unstructured. I'll figure out how helpful the difference is between my current implementation and something like spaCy would be for say, a long speech in a .txt file.

TanGentleman commented 1 month ago

Playing around with UnstructuredFileLoader where it partitions the pdf into various elements is probably the best way to really get precise with it. For now, I'm not sure it'll affect the quality of my outputs all that much, but I'll do some more testing with loading docs using different Unstructured loaders/params

TanGentleman commented 1 month ago

TBH, I haven't been loving it. Seems like high quality document processing is something I would rather handle with external APIs, and unless there's a really vital use case where this has to be done locally, I'll get back to it then.