codelibs / fess

Fess is very powerful and easily deployable Enterprise Search Server.
https://fess.codelibs.org
Apache License 2.0
989 stars 166 forks source link

PDF's attachments not indexed #1994

Closed freestyle68 closed 5 years ago

freestyle68 commented 5 years ago

Hi,

actually an embedded attachment of a pdf is not indexed. There is a workaround to fix?

Thank you

marevol commented 5 years ago

It depends on TIka and PDFBox.

freestyle68 commented 5 years ago

But tika app standalone extract attachments content:

with Tika app https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.20.jar

create two folders, in and out and launch

java -jar tika-app-1.20.jar -T -i /"path for in" -o /"path for out"

in the out folder I can see all attachents extracted, pdf, excel, etc.

For the attached sample.pdf I get the output sample.pdf.txt I attach the documents.

sample.pdf sample.pdf.txt

Why is not possible with Fess?

marevol commented 5 years ago

I'll fix it in a future release...

freestyle68 commented 5 years ago

with this commmit:

https://github.com/codelibs/fess-crawler/commit/9c41f5e8d6adc2b3a31b37cdc6d0be8d6b31d1a1

I get a java.lang.ClassCastException with filesystem crawling:

Path: file:/pdfs/

Log: org.codelibs.fess.crawler.exception.CrawlingAccessException: Could not serialize objectat org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:84)at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77)at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330)at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240)at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176)at java.base/java.lang.Thread.run(Thread.java:844)Caused by: java.lang.ClassCastException: java.base/java.lang.String cannot be cast to java.base/[Ljava.lang.Object;at org.codelibs.fess.crawler.transformer.FessTransformer.putResultDataBody(FessTransformer.java:117)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.generateData(AbstractFessFileTransformer.java:244)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:82)...

marevol commented 5 years ago

I think you used wrong versions.

freestyle68 commented 5 years ago

It happens with several docs: pdf, doc, pptx, etc.

I attach a sample of docs with this error, they are from https://openpreservation.org/technology/corpora/govdocs/

My Java version:

openjdk 10.0.2 2018-07-17 OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4) OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)

This problem was introduced from commit https://github.com/codelibs/fess/tree/f341a4e2b29d7130bab5b058d1d989c6e3f1634f , because before it was all right

docs.zip

marevol commented 5 years ago

How did you create Fess?

freestyle68 commented 5 years ago

mvn antrun:run mvn package -DskipTests

Then used fess-13.0.0-SNAPSHOT.zip

marevol commented 5 years ago

Thanks, I found it. Fixed in #2009.

freestyle68 commented 5 years ago

Hi,

regarding the starting question, now Fess index also attachments content. Tested with pdf, msg, elm. So thanks for your commit.

But still missing the attachment filename from Fess index, while tika can extract this also. For example with

java -jar tika-app-1.20.jar -x sample.pdf

I get

<div source="attachment" class="embedded" id="attachment.pdf"/>
<div class="acroform"><ol/>
</div>

and with a msg file I get a similar output:

<div class="attachment-entry"><h1>attachment.pdf</h1>
<div class="package-entry"><h1>attachment.pdf</h1>
<div class="page"><p/>
</div>

Please do not forget to add this feature in the future.

freestyle68 commented 5 years ago

Perhaps it was my fault or your commit, but with the actual version (12.6 last commit) I can search also attachments filename.

So this problem is fixed.

Thanks