Closed freestyle68 closed 5 years ago
It depends on TIka and PDFBox.
But tika app standalone extract attachments content:
with Tika app https://www.apache.org/dyn/closer.cgi/tika/tika-app-1.20.jar
create two folders, in and out and launch
java -jar tika-app-1.20.jar -T -i /"path for in" -o /"path for out"
in the out folder I can see all attachents extracted, pdf, excel, etc.
For the attached sample.pdf I get the output sample.pdf.txt I attach the documents.
Why is not possible with Fess?
I'll fix it in a future release...
with this commmit:
https://github.com/codelibs/fess-crawler/commit/9c41f5e8d6adc2b3a31b37cdc6d0be8d6b31d1a1
I get a java.lang.ClassCastException with filesystem crawling:
Path: file:/pdfs/
Log: org.codelibs.fess.crawler.exception.CrawlingAccessException: Could not serialize objectat org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:84)at org.codelibs.fess.crawler.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:77)at org.codelibs.fess.crawler.CrawlerThread.processResponse(CrawlerThread.java:330)at org.codelibs.fess.crawler.FessCrawlerThread.processResponse(FessCrawlerThread.java:240)at org.codelibs.fess.crawler.CrawlerThread.run(CrawlerThread.java:176)at java.base/java.lang.Thread.run(Thread.java:844)Caused by: java.lang.ClassCastException: java.base/java.lang.String cannot be cast to java.base/[Ljava.lang.Object;at org.codelibs.fess.crawler.transformer.FessTransformer.putResultDataBody(FessTransformer.java:117)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.generateData(AbstractFessFileTransformer.java:244)at org.codelibs.fess.crawler.transformer.AbstractFessFileTransformer.transform(AbstractFessFileTransformer.java:82)...
I think you used wrong versions.
It happens with several docs: pdf, doc, pptx, etc.
I attach a sample of docs with this error, they are from https://openpreservation.org/technology/corpora/govdocs/
My Java version:
openjdk 10.0.2 2018-07-17 OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4) OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)
This problem was introduced from commit https://github.com/codelibs/fess/tree/f341a4e2b29d7130bab5b058d1d989c6e3f1634f , because before it was all right
How did you create Fess?
mvn antrun:run mvn package -DskipTests
Then used fess-13.0.0-SNAPSHOT.zip
Thanks, I found it. Fixed in #2009.
Hi,
regarding the starting question, now Fess index also attachments content. Tested with pdf, msg, elm. So thanks for your commit.
But still missing the attachment filename from Fess index, while tika can extract this also. For example with
java -jar tika-app-1.20.jar -x sample.pdf
I get
<div source="attachment" class="embedded" id="attachment.pdf"/>
<div class="acroform"><ol/>
</div>
and with a msg file I get a similar output:
<div class="attachment-entry"><h1>attachment.pdf</h1>
<div class="package-entry"><h1>attachment.pdf</h1>
<div class="page"><p/>
</div>
Please do not forget to add this feature in the future.
Perhaps it was my fault or your commit, but with the actual version (12.6 last commit) I can search also attachments filename.
So this problem is fixed.
Thanks
Hi,
actually an embedded attachment of a pdf is not indexed. There is a workaround to fix?
Thank you