freelawproject / doctor

A microservice for document conversion at scale
https://free.law/projects/doctor
BSD 2-Clause "Simplified" License
54 stars 14 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 8811: invalid start byte #181

Open sentry-io[bot] opened 7 months ago

sentry-io[bot] commented 7 months ago

Sentry Issue: DOCTOR-E

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 8811: invalid start byte
  File "doctor/views.py", line 101, in extract_doc_content
    content, err, returncode = extract_from_html(fp)
  File "doctor/tasks.py", line 339, in extract_from_html
    content = f.read()

This is linked to the courtlistener Sentry issue https://freelawproject.sentry.io/issues/5017932231/?project=5257254, the events were registered at almost the same time.

Also, is one of the causes of this issue freelawproject/courtlistener#3811

Filed by @grossir