Closed jetnet closed 8 years ago
The issue is as you suspected: there are too many nested elements for the maximum supported by Tika. I increased the maximum in a copy of the faulty class until a more formal fix is provided by the Tika team (Tika currently does not make this maximum configurable).
To try the fix, can you try to replace the norconex-importer-[VERSION].jar
in your installation with the one from the latest Importer snapshot release.
I updated the corresponding Tika ticket with this issue: https://issues.apache.org/jira/browse/TIKA-741
I think this is an issue with norconex's EnhancedPDF2XHTML class...see TIKA-741 for the recommended modification. Give that a try with the max set to 100 and let us know if you're good to go.
Y, I just tested removing those lines from our code, and I hit the zip bomb exception.
That did it, thanks! I committed the fix in our Importer module and will create a new release of Importer a bit later.
I just made a new HTTP Collector snapshot release with the updated importer. @jetnet, please give it a try and confirm.
Seems to be working with norconex-collector-http-2.4.0-20160223.174916-28.zip
Thanks!
I'm going to start a full crawl, to check, if all PDFs can be parsed now.
2.4.0 has now been officially released with this fix. Please create a new ticket if the issue persists.
hi! some PDFs still cannot be parsed:
Is it something that can be configured? Number of nested XML elements or the ration input/output bytes? Thank you!