Open TanGentleman opened 2 months ago
Playing around with UnstructuredFileLoader where it partitions the pdf into various elements is probably the best way to really get precise with it. For now, I'm not sure it'll affect the quality of my outputs all that much, but I'll do some more testing with loading docs using different Unstructured loaders/params
TBH, I haven't been loving it. Seems like high quality document processing is something I would rather handle with external APIs, and unless there's a really vital use case where this has to be done locally, I'll get back to it then.
I want to this alongside the migration to Unstructured. I'll figure out how helpful the difference is between my current implementation and something like spaCy would be for say, a long speech in a .txt file.