ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
591 stars 53 forks source link

not indexing any PDF files (PDFStreamEngine stream of error) #1591

Open jafooool opened 2 weeks ago

jafooool commented 2 weeks ago

Describe the bug added a trove of PDF file ... launch indexing ... get only

Error writing: org.apache.tika.sax.TaggedSAXException: Error writing: org.xml.sax.SAXException: Error writing: java.io.IOException: Read end dead 2024-10-08 10:35:10,354 [Apache Tika: XXXXX.pdf] WARN PDFStreamEngine - org.apache.tika.sax.TaggedSAXException: Error writing: org.apache.tika.sax.TaggedSAXException: Error writing: org.xml.sax.SAXException: Error writing: java.io.IOException: Read end dead

etc. No PDF files get indexed

Desktop (please complete the following information):

last available version of DATASHARE

bamthomas commented 1 week ago

Which datashare version?

I tried with the 18.3.0 (latest: from yesterday) and it works fine with PDF (with and without OCR).

I've already seen this kind of error when there is a low level issue with file access or with badly encoded PDF files.

Are you sure that your PDF files are not corrupted? Or that the access to the filesystem is OK?