Unstructured-IO / unstructured-api

Apache License 2.0
528 stars 110 forks source link

PdfStreamError: Stream has ended unexpectedly #266

Closed sentry-io[bot] closed 1 year ago

sentry-io[bot] commented 1 year ago

API users are hitting this error on certain files.

PdfStreamError: Stream has ended unexpectedly
  File "prepline_general/api/general.py", line 686, in pipeline_1
    list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
  File "prepline_general/api/general.py", line 607, in response_generator
    response = pipeline_api(
  File "prepline_general/api/general.py", line 278, in pipeline_api
    pdf = PdfReader(file)
  File "pypdf/_reader.py", line 332, in __init__
    self.read(stream)
  File "pypdf/_reader.py", line 1554, in read
    self._find_eof_marker(stream)
  File "pypdf/_reader.py", line 1625, in _find_eof_marker
    line = read_previous_line(stream)
  File "pypdf/_utils.py", line 268, in read_previous_line
    raise PdfStreamError(STREAM_TRUNCATED_PREMATURELY)
awalker4 commented 1 year ago

I verified that this error can happen when we send a non-pdf with content type as pdf. We don't confirm the filetype if it's provided, and so PdfReader() blows up.

import requests

filename = "/path/to/jpeg/file"
import requests

res = requests.post(
    "http://localhost:8000/general/v0/general",
    files={"files": (filename, open(filename, "rb"), "application/pdf")},
    )

print(res.text)

# {"detail":"Stream has ended unexpectedly"}

PyPDF logs a warning:

WARNING:pypdf._reader:invalid pdf header: b'\xff\xd8\xff\xe1\x9b'
WARNING:pypdf._reader:EOF marker not found
cragwolfe commented 1 year ago

so the right thing to do here is catch that error and return a 400 with friendly message?

awalker4 commented 1 year ago

Yep! The bug squash is off to a good start!