buda-base / drs-deposit

Harvard DRS Deposit base
1 stars 0 forks source link

Tika Errors building batch #25

Closed jimk-bdrc closed 6 years ago

jimk-bdrc commented 6 years ago

Many errors building batch W1KG17209: 6572 [Tika] ERROR edu.harvard.hul.ois.fits.tools.ToolBase - Caught exception running tool: TikaTool edu.harvard.hul.ois.fits.exceptions.FitsToolException: Exception reading metadata (Error on line 1: An invalid XML character (Unicode: 0x1e) was found in the element content of the document.) at edu.harvard.hul.ois.fits.tools.tika.TikaTool.buildRawData(TikaTool.java:525) at edu.harvard.hul.ois.fits.tools.tika.TikaTool.extractInfo(TikaTool.java:448) at edu.harvard.hul.ois.fits.tools.ToolBase.run(ToolBase.java:268) at java.lang.Thread.run(Thread.java:748)

First, these errors are non-fatal. The batch produces even when they occur.

I tried Vitaly's strategy of excluding jpg and jpeg filetypes from fits.xml tool Tika, and that:

  1. Removed the errors
  2. Produced a descriptor.xml for the images which was identical to the one where Tika was looking at the files and throwing a fault.
jimk-bdrc commented 6 years ago

16 Jan: Inquired to Vitaly if this was OK