ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
598 stars 54 forks source link

not indexing any PDF files (PDFStreamEngine stream of error) #1591

Open jafooool opened 1 month ago

jafooool commented 1 month ago

Describe the bug added a trove of PDF file ... launch indexing ... get only

Error writing: org.apache.tika.sax.TaggedSAXException: Error writing: org.xml.sax.SAXException: Error writing: java.io.IOException: Read end dead 2024-10-08 10:35:10,354 [Apache Tika: XXXXX.pdf] WARN PDFStreamEngine - org.apache.tika.sax.TaggedSAXException: Error writing: org.apache.tika.sax.TaggedSAXException: Error writing: org.xml.sax.SAXException: Error writing: java.io.IOException: Read end dead

etc. No PDF files get indexed

Desktop (please complete the following information):

last available version of DATASHARE

bamthomas commented 1 month ago

Which datashare version?

I tried with the 18.3.0 (latest: from yesterday) and it works fine with PDF (with and without OCR).

I've already seen this kind of error when there is a low level issue with file access or with badly encoded PDF files.

Are you sure that your PDF files are not corrupted? Or that the access to the filesystem is OK?