Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Bump tika.version from 1.22 to 1.23 in /norconex-importer #107

Closed dependabot[bot] closed 4 years ago

dependabot[bot] commented 4 years ago

Bumps tika.version from 1.22 to 1.23.

Updates tika-core from 1.22 to 1.23

Changelog *Sourced from [tika-core's changelog](https://github.com/apache/tika/blob/master/CHANGES.txt).* > Release 2.0.0 - ??? > BREAKING CHANGES in 2.0.0 > > * Remove deprecated Metadata keys/properties (TIKA-1974). > > Other changes > > Release 1.24 - ??? > > * Fix bug in ASM parser configuration (TIKA-2992). > > * Upgrade to java-libpst 0.9.3 (TIKA-2546). > > * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). > > Release 1.23 - 12/02/2019 > > * NOTE: The PDFParser now relies on OCRDPI to render page images when > users configure OCR on rendered page images. This will have the effect > of increasing rendered image size (TIKA-2624). > > * NOTE: tika-server no longer returns 415 for file types for which there > is no parser. > > * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). > > * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). > > * Upgrade to POI 4.1.1 (TIKA-2851). > > * Upgrade to PDFBox 2.0.17 (TIKA-2951). > > * Ensure that the PDFParser respects custom configuration of Tesseract > from tika-config.xml via Eric Pugh (TIKA-2970). > > * Add parser for XLIFF v1.2 files (TIKA-2975). > > * Add mime type detection support for WebAssembly (TIKA-2894), > HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); > and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). > > * Add an XLZ Parser (TIKA-2976). > > * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). > > Release 1.22 - 07/29/2019 > > * NOTE: tika-server no longer hard-codes the HtmlParser to handle > XML files (TIKA-2910). Users must now configure that behavior > via a tika-config.xml file. > ... (truncated)
Commits - [`79f6c6c`](https://github.com/apache/tika/commit/79f6c6c604d06780bd4d0c9fc4a8a95bee261a82) [maven-release-plugin] prepare release 1.23-rc2 - [`a8d20dd`](https://github.com/apache/tika/commit/a8d20dd33ead3eb37447f20dc27b5a4810db6010) roll back in preparation for 1.23-rc2 - [`4eb73d0`](https://github.com/apache/tika/commit/4eb73d02888fa36d00305f47d3c70f8d9f3d9b48) TIKA-2630 -- add defensive null check and fix "if (...width)" to "if (...heig... - [`3c3214d`](https://github.com/apache/tika/commit/3c3214d05fcb190c169c2d5424c6f9f1d9b1266c) update changes for 1.23-rc2 - [`5920689`](https://github.com/apache/tika/commit/5920689ce217083e3e51becee4899e1c1059e030) improve logging and error handling in TikaServerIntegrationTest - [`42676a6`](https://github.com/apache/tika/commit/42676a6b2e73306458643603381c46743801a544) improve logging and error reporting in TikaServerIntegrationTest - [`85cd36d`](https://github.com/apache/tika/commit/85cd36d5f16e27bd5ec222ff18dcb3035c5f502d) TIKA-2925 -- improve documentation to explain decision not - [`f67a834`](https://github.com/apache/tika/commit/f67a83444036d4fb5b23e9000f06434bfb58eefc) TIKA-3002 -- fix bug in OCR AUTO mode - [`b86eb05`](https://github.com/apache/tika/commit/b86eb05a2e379fa61e7bb46e0a81bbd60b262eb5) TIKA-2630 -- cleanup unit test - [`90880e1`](https://github.com/apache/tika/commit/90880e179f98c06ca948b8fed65e902f5b520e6b) TIKA-2630: Wrong height and width metadata for JPEG images ([#255](https://github-redirect.dependabot.com/apache/tika/issues/255)) - Additional commits viewable in [compare view](https://github.com/apache/tika/compare/1.22...1.23)


Updates tika-parsers from 1.22 to 1.23

Changelog *Sourced from [tika-parsers's changelog](https://github.com/apache/tika/blob/master/CHANGES.txt).* > Release 2.0.0 - ??? > BREAKING CHANGES in 2.0.0 > > * Remove deprecated Metadata keys/properties (TIKA-1974). > > Other changes > > Release 1.24 - ??? > > * Fix bug in ASM parser configuration (TIKA-2992). > > * Upgrade to java-libpst 0.9.3 (TIKA-2546). > > * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). > > Release 1.23 - 12/02/2019 > > * NOTE: The PDFParser now relies on OCRDPI to render page images when > users configure OCR on rendered page images. This will have the effect > of increasing rendered image size (TIKA-2624). > > * NOTE: tika-server no longer returns 415 for file types for which there > is no parser. > > * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). > > * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). > > * Upgrade to POI 4.1.1 (TIKA-2851). > > * Upgrade to PDFBox 2.0.17 (TIKA-2951). > > * Ensure that the PDFParser respects custom configuration of Tesseract > from tika-config.xml via Eric Pugh (TIKA-2970). > > * Add parser for XLIFF v1.2 files (TIKA-2975). > > * Add mime type detection support for WebAssembly (TIKA-2894), > HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); > and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). > > * Add an XLZ Parser (TIKA-2976). > > * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). > > Release 1.22 - 07/29/2019 > > * NOTE: tika-server no longer hard-codes the HtmlParser to handle > XML files (TIKA-2910). Users must now configure that behavior > via a tika-config.xml file. > ... (truncated)
Commits - [`79f6c6c`](https://github.com/apache/tika/commit/79f6c6c604d06780bd4d0c9fc4a8a95bee261a82) [maven-release-plugin] prepare release 1.23-rc2 - [`a8d20dd`](https://github.com/apache/tika/commit/a8d20dd33ead3eb37447f20dc27b5a4810db6010) roll back in preparation for 1.23-rc2 - [`4eb73d0`](https://github.com/apache/tika/commit/4eb73d02888fa36d00305f47d3c70f8d9f3d9b48) TIKA-2630 -- add defensive null check and fix "if (...width)" to "if (...heig... - [`3c3214d`](https://github.com/apache/tika/commit/3c3214d05fcb190c169c2d5424c6f9f1d9b1266c) update changes for 1.23-rc2 - [`5920689`](https://github.com/apache/tika/commit/5920689ce217083e3e51becee4899e1c1059e030) improve logging and error handling in TikaServerIntegrationTest - [`42676a6`](https://github.com/apache/tika/commit/42676a6b2e73306458643603381c46743801a544) improve logging and error reporting in TikaServerIntegrationTest - [`85cd36d`](https://github.com/apache/tika/commit/85cd36d5f16e27bd5ec222ff18dcb3035c5f502d) TIKA-2925 -- improve documentation to explain decision not - [`f67a834`](https://github.com/apache/tika/commit/f67a83444036d4fb5b23e9000f06434bfb58eefc) TIKA-3002 -- fix bug in OCR AUTO mode - [`b86eb05`](https://github.com/apache/tika/commit/b86eb05a2e379fa61e7bb46e0a81bbd60b262eb5) TIKA-2630 -- cleanup unit test - [`90880e1`](https://github.com/apache/tika/commit/90880e179f98c06ca948b8fed65e902f5b520e6b) TIKA-2630: Wrong height and width metadata for JPEG images ([#255](https://github-redirect.dependabot.com/apache/tika/issues/255)) - Additional commits viewable in [compare view](https://github.com/apache/tika/compare/1.22...1.23)


Updates tika-translate from 1.22 to 1.23

Changelog *Sourced from [tika-translate's changelog](https://github.com/apache/tika/blob/master/CHANGES.txt).* > Release 2.0.0 - ??? > BREAKING CHANGES in 2.0.0 > > * Remove deprecated Metadata keys/properties (TIKA-1974). > > Other changes > > Release 1.24 - ??? > > * Fix bug in ASM parser configuration (TIKA-2992). > > * Upgrade to java-libpst 0.9.3 (TIKA-2546). > > * Fixed XLIFF12Parser failures with ToXMLHandler (TIKA-3014). > > Release 1.23 - 12/02/2019 > > * NOTE: The PDFParser now relies on OCRDPI to render page images when > users configure OCR on rendered page images. This will have the effect > of increasing rendered image size (TIKA-2624). > > * NOTE: tika-server no longer returns 415 for file types for which there > is no parser. > > * Fix bug in AUTO OCR strategy in the PDFParser (TIKA-3002). > > * Fix incorrect height and width metadata extraction from JPEG images (TIKA-2630). > > * Upgrade to POI 4.1.1 (TIKA-2851). > > * Upgrade to PDFBox 2.0.17 (TIKA-2951). > > * Ensure that the PDFParser respects custom configuration of Tesseract > from tika-config.xml via Eric Pugh (TIKA-2970). > > * Add parser for XLIFF v1.2 files (TIKA-2975). > > * Add mime type detection support for WebAssembly (TIKA-2894), > HEIF / HEIC images (TIKA-2942), Digilite FDF (TIKA-2988); > and xml-root detection for XFDF (TIKA-2990) and XDP (TIKA-2989). > > * Add an XLZ Parser (TIKA-2976). > > * Fix deadlock with ForkParser when InputStream throws IOException (TIKA-2892). > > Release 1.22 - 07/29/2019 > > * NOTE: tika-server no longer hard-codes the HtmlParser to handle > XML files (TIKA-2910). Users must now configure that behavior > via a tika-config.xml file. > ... (truncated)
Commits - [`79f6c6c`](https://github.com/apache/tika/commit/79f6c6c604d06780bd4d0c9fc4a8a95bee261a82) [maven-release-plugin] prepare release 1.23-rc2 - [`a8d20dd`](https://github.com/apache/tika/commit/a8d20dd33ead3eb37447f20dc27b5a4810db6010) roll back in preparation for 1.23-rc2 - [`4eb73d0`](https://github.com/apache/tika/commit/4eb73d02888fa36d00305f47d3c70f8d9f3d9b48) TIKA-2630 -- add defensive null check and fix "if (...width)" to "if (...heig... - [`3c3214d`](https://github.com/apache/tika/commit/3c3214d05fcb190c169c2d5424c6f9f1d9b1266c) update changes for 1.23-rc2 - [`5920689`](https://github.com/apache/tika/commit/5920689ce217083e3e51becee4899e1c1059e030) improve logging and error handling in TikaServerIntegrationTest - [`42676a6`](https://github.com/apache/tika/commit/42676a6b2e73306458643603381c46743801a544) improve logging and error reporting in TikaServerIntegrationTest - [`85cd36d`](https://github.com/apache/tika/commit/85cd36d5f16e27bd5ec222ff18dcb3035c5f502d) TIKA-2925 -- improve documentation to explain decision not - [`f67a834`](https://github.com/apache/tika/commit/f67a83444036d4fb5b23e9000f06434bfb58eefc) TIKA-3002 -- fix bug in OCR AUTO mode - [`b86eb05`](https://github.com/apache/tika/commit/b86eb05a2e379fa61e7bb46e0a81bbd60b262eb5) TIKA-2630 -- cleanup unit test - [`90880e1`](https://github.com/apache/tika/commit/90880e179f98c06ca948b8fed65e902f5b520e6b) TIKA-2630: Wrong height and width metadata for JPEG images ([#255](https://github-redirect.dependabot.com/apache/tika/issues/255)) - Additional commits viewable in [compare view](https://github.com/apache/tika/compare/1.22...1.23)


Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot ignore this [patch|minor|major] version` will close this PR and stop Dependabot creating any more for this minor/major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/Norconex/importer/network/alerts).