asepaprianto / crawler4j

Automatically exported from code.google.com/p/crawler4j
0 stars 0 forks source link

TikaException is thrown while crawling several PDFs in a row #279

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Add this seed to the controller:
        controller.addSeed("http://www.cso.ie/en/releasesandpublications/statisticalyearbookofireland/statisticalyearbookofireland2013edition/");

Enable binary parsing
Do not filter PDFs

You will see the following exception:
WARN [Crawler 1] Parsing error of: 
http://www.cso.ie/en/media/csoie/releasespublications/documents/statisticalyearb
ook/2013/tableofcontents.pdf
ERROR [Crawler 1] Error parsing file
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@1bd14960
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) ~[tika-core-1.5.jar:na]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ~[tika-core-1.5.jar:na]
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) ~[tika-core-1.5.jar:na]
    at edu.uci.ics.crawler4j.parser.BinaryParseData.setBinaryContent(BinaryParseData.java:64) ~[classes/:na]
    at edu.uci.ics.crawler4j.parser.Parser.parse(Parser.java:62) [classes/:na]
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:354) [classes/:na]
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:255) [classes/:na]
    at java.lang.Thread.run(Thread.java:722) [na:1.7.0_05]
Caused by: org.apache.tika.metadata.PropertyTypeException: Composite Properties 
must not include other Composite Properties as either Primary or Secondary
    at org.apache.tika.metadata.Metadata.add(Metadata.java:336) ~[tika-core-1.5.jar:na]
    at org.apache.tika.parser.pdf.PDFParser.addMetadata(PDFParser.java:199) ~[tika-parsers-1.5.jar:na]
    at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:160) ~[tika-parsers-1.5.jar:na]
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:142) ~[tika-parsers-1.5.jar:na]
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) ~[tika-core-1.5.jar:na]

Original issue reported on code.google.com by avrah...@gmail.com on 17 Aug 2014 at 4:08

GoogleCodeExporter commented 9 years ago
Fixed in revision hash: 9a9b6846f3a5    

Original comment by avrah...@gmail.com on 17 Aug 2014 at 4:12