Unstructured-IO / unstructured-ingest

Apache License 2.0
18 stars 19 forks source link

Unnecessary requirements for pdf? #93

Open davidgilbertson opened 2 months ago

davidgilbertson commented 2 months ago

I'm trying the Serverless API because I couldn't get unstructured[pdf] to install (package clashes caused by install an old version of PyTorch).

The docs say to use the API I should use unstructured-ingest and this page says that if I want to convert a PDF I should do pip install "unstructured-ingest[pdf]". Half-expecting this to download the wrong PyTorch again (which takes ages, then ages to reinstall the new one) I thought I'd check the requirements:

https://github.com/Unstructured-IO/unstructured-ingest/blob/main/requirements/local_partition/pdf.in

And it looks like that's just going to install unstructured[pdf], the thing I'm trying to avoid!

So my question, why does this client library that just calls APIs need to install the whole gigantic unstructured package?

I tried the sample code without running this install (which breaks my whole environment) and it seems to work.

Some friendly new-user feedback: this is all very difficult! I have a funny feeling that the results are going to be impressive, but my gosh the developer experience is terrible so far.

potter-potter commented 2 months ago

@davidgilbertson Sorry you are having a tough time with Unstructured.

If you are using the Serverless API you shouldn't need the pip install "unstructured-ingest[pdf]". Since you won't be actually processing those files locally.

Please try this python code here and point it to your api key, api key url, local documents (.pdf) folder, and output directory. (you don't necessarily have to use the environment variables... you can just fill in the values to keep it simpler.)

Feel free to tag me here if you still have an issue.