apache / camel-quarkus

Apache Camel Quarkus
https://camel.apache.org
Apache License 2.0
255 stars 189 forks source link

Tika pdf failure after upgrade to pdfBox in Camel, requires new quarkiverse-tika #5234

Open JiriOndrusek opened 1 year ago

JiriOndrusek commented 1 year ago

Bug description

Camel upgraded pdfbox to 3.x (https://issues.apache.org/jira/browse/CAMEL-19796).

Pdfbox 3.x is not back compatible with 2.x, therefore quarkiverse-tika used by tika extension fails with the new pdfbox.

Apache tika is aware of the new version of pdfbox and the upgrade ticket is already in progress - see https://issues.apache.org/jira/browse/TIKA-3347

As soon as an new version of apache tika is released it has to be adopted by quarkiverse-tika and this new version has to be adopted by camel-quarkus.

I'm disabling tika tests using pdfbox until provblem is solved (on camel-main)

JiriOndrusek commented 1 year ago

@ppalaga @jamesnetherton We have to wait for the new release of quarkiverse-tika supporting the new pdfbox. There is probably not much other options (the pdfbox involves i.e. fop, tika, pdf extensions, therefore keeping pdfbox on 2.x may be complicated. WDYT?

jamesnetherton commented 1 year ago

We have to wait for the new release quarkiverse-tika

Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika & pdf components together in the same app?

JiriOndrusek commented 1 year ago

We have to wait for the new release quarkiverse-tika

Or we propose to revert the upgrade in Camel. I assume the same issue exists there if you try to bring the tika & pdf components together in the same app?

Error starts to happen if tika parses any pdf file -> other functionality should work even with pdf extension together.

Reverting of the Camel change and postponing it until quarkiverse-tika would support 3.x would be an easy solution

oscerd commented 1 year ago

There is no point in postponing the change in Camel. Not for special reasons, but just because we cannot base the core development on what Quarkus/Quarkiverse does. It's dangerous and not healthy.

oscerd commented 1 year ago

If errors appear in Tika on plain camel, then it make sense to wait for a Tika release supporting pdfbox 3.x, but this is not evident through tests.

JiriOndrusek commented 1 year ago

@oscerd I'll add some tests covering pdf into tika component, because I can not see pdf file there - https://github.com/apache/camel/tree/main/components/camel-tika/src/test/resources

If some troubles emerges, we can discuss what to do next. Does it sounds ok?

oscerd commented 1 year ago

Yes, it is.

oscerd commented 1 year ago

Also if you check in the SBOM for camel, pdfbox is used explicitly only in camel-pdf and camel-fop.

Camel-tika is using only:

    {
      "ref" : "pkg:maven/org.apache.camel/camel-tika@4.1.0-SNAPSHOT?type=jar",
      "dependsOn" : [
        "pkg:maven/org.apache.camel/camel-support@4.1.0-SNAPSHOT?type=jar",
        "pkg:maven/org.apache.tika/tika-core@2.8.0?type=jar",
        "pkg:maven/org.apache.tika/tika-parser-html-commons@2.8.0?type=jar",
        "pkg:maven/org.apache.tika/tika-parser-text-module@2.8.0?type=jar"
      ]
    }

both the tika commons and tika text modules are not using pdfbox. It's only something related to quarkiverse extension : https://github.com/quarkiverse/quarkus-tika/blob/main/pom.xml#L63

JiriOndrusek commented 1 year ago

I missed that fact, though having pdf part of the tika test coverage make sense nevertheless. (but with lower priority)

JiriOndrusek commented 1 year ago

@oscerd I tried to create a test which parses pdf (using plain Camel). For that purpose I had to add:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parser-pdf-module</artifactId>
    <version>${tika-version}</version> (2.8.0)
</dependency>

to get PdfParser for Tika. This dependency brings PdfBox 2.x (see https://mvnrepository.com/artifact/org.apache.tika/tika-parser-pdf-module/2.9.0)

Therefore user which would like to use pdf parser for Tika might had a conflict in dependencies in case i.e. camel-pdf is also part of the project. The failure is caused by the no-compatibility between pdfbox 2.x and 3.x.

java.lang.NoSuchMethodError: 'org.apache.pdfbox.pdmodel.PDDocument org.apache.pdfbox.pdmodel.PDDocument.load(java.io.InputStream, java.lang.String, org.apache.pdfbox.io.MemoryUsageSetting)'
    at org.apache.tika.parser.pdf.PDFParser.getPDDocument(PDFParser.java:421)

I can imagine that this use case does not make sense - parsing pdf by camel-tika and depend on camel-pdf.

In case that useage of camel-tika (for pdf parsing) together with camel-pdf is not supported, forcing similar restriction for camel-quarkus might solve the problem. (Currently we are testing tika extension for parsing pdf files)

--- edited

tika version is 2.8. not 2.9.0 (behavior is the same)

oscerd commented 1 year ago

To me this is really a corner case. I understand the point, but it's a tika problem.

JiriOndrusek commented 1 year ago

I agree, so the right way for camel-quarkus is not testing pdf parsing with Tika (as this is the corner case in plain Camel, which might not work in some cases), Camel-pdf should be used instead.

oscerd commented 1 year ago

Yes, once they align we could re-use tika for parsing PDF