chrismattmann / imagecat

ImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.
94 stars 40 forks source link

Some Images fail #8

Closed quasiben closed 9 years ago

quasiben commented 9 years ago

Images are partially processed but not parsed correctly:

INFO: on.SolrException: org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.jpeg.JpegParser@5c0bae4a
OUTPUT:         at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
OUTPUT:         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
OUTPUT:         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
OUTPUT:         at org.apache.solr.core.RequestHandler
Apr 15, 2015 9:18:29 PM org.apache.oodt.commons.io.LoggerOutputStream flush

Warn users this happens or fix?

chrismattmann commented 9 years ago

I don't have a good answer to this one, @quasiben. This is a Tika underlying parser issue, e.g., the JpegParser. Can you post this issue upstream to Tika along with an example of the files it fails on or is it a questionable img?