deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.83k stars 585 forks source link

Extract information from bytes #300

Open asciidiego opened 4 years ago

asciidiego commented 4 years ago

I have a PDF that I have downloaded, so is not saved as a file yet. How can I use textract to extract the text without actually saving the file?

jpweytjens commented 4 years ago

What do you mean with "downloaded, but not saved as a file yet"?

Textract requires that you specify the path to the pdf file. So far I have only parsed files that have been saved locally. You might try some of the ideas here, but I don't completly understand what you're trying to do.

asciidiego commented 4 years ago

I get the PDFs from a HTTP response. So, with the body (as bytes) I should be able to extract the pdf from the bytes alone, I do not think it's necessary to save the PDF as a file, to then parse it to extract the text to then delete the created file; when it was already in memory as a Python variable.

jpweytjens commented 4 years ago

Currently, textract does not supports streams. See also #85, #97 and #99. Perhaps this might be able to help you while we work on support for streams.

multinucliated commented 3 years ago

any progress in byte stream ( file.read() ) or you can suggest any other way out ?

shzy2012 commented 3 years ago
import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")
uxtt2000 commented 1 year ago
import textract
with tempfile.NamedTemporaryFile(delete=True) as temp:
    temp.write(f.read())
    temp.flush()
    context = textract.process(temp.name,encoding='utf-8',extension=".pdf")

That's the solution. Works like a charm and works in the cloud in a stateless function without any filesystem access! Thanks @shzy2012 ! @jpweytjens : Maybe put this workaround in the docs while streams are not yet supported, as its really good for usage cloudbased Thanks