Unstructured-IO / unstructured-api

Apache License 2.0
529 stars 110 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte #267

Open sentry-io[bot] opened 1 year ago

sentry-io[bot] commented 1 year ago
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 10: invalid continuation byte
(23 additional frame(s) were not displayed)
...
  File "prepline_general/api/general.py", line 686, in pipeline_1
    list(response_generator(is_multipart=False))[0] if len(files) == 1 else join_responses(list(response_generator(is_multipart=False)))
  File "prepline_general/api/general.py", line 607, in response_generator
    response = pipeline_api(
  File "prepline_general/api/general.py", line 418, in pipeline_api
    raise e
  File "prepline_general/api/general.py", line 396, in pipeline_api
    elements = partition(
Krishna2709 commented 3 months ago

Hi, I am facing the same error. Please let me know if you resolved it.

awalker4 commented 2 months ago

Hi there! Do you have a file that reproduces the issue that you're able to share?

andrePankraz commented 2 months ago

Same problem via unstructured-python-client: Failed to process a request due to API server error with status code 500. Attempting retry number 1 after sleep. unstructured-client: 36 - log_retries()] Server message - {"detail":"'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"}

If I try to send some file with e.g. encoding UTF-16 and it will not work. The encoding parameter is set correctly and can be seen here unstructured-client/general.py req = client.prepare_request(requests_http.Request('POST', url, params=query_params, data=data, files=form, headers=headers))

I'm not sure if the issue is with the unstructured-python-client not encoding the form-post correctly or setting the accept header correctly, or if it's a problem with the server API.

awalker4 commented 2 months ago

Hi @andrePankraz , can you clarify how you're making the API call? The server does take a encoding param (shown in the table here) that defaults to utf-8. I suspect this file will work if you send encoding='utf-16'.

andrePankraz commented 2 months ago

Have you really tested it with an utf-16 file?

curl -X 'POST' \
    'http://ai1.dev.init:8004/general/v0/general' \
    -H 'accept: application/json' \
    -H 'Content-Type: multipart/form-data' \
    -F 'files=@data/documents/CSV_UTF_16.csv' \
    -F 'strategy=hi_res' \
    -F 'languages=deu' \
    -F 'encoding=utf-16'

{"detail":"'utf-8' codec can't decode byte 0xff in position 0: invalid start byte"}

Krishna2709 commented 2 months ago

Hi there! Do you have a file that reproduces the issue that you can share?

Hey @awalker4 , my file was corrupted while formatting it. There's no issue from the library.