Closed sronsiek closed 5 years ago
Pinging @elastic/es-core-features
IMO that's a bad idea to support it anyway. It will flatten all the content so you will never know where exactly the text is coming from. Also it means that you might want to send a lot of data to elasticsearch over the wire. Another problem is that Tika is not super efficient with compressed files and it needs to write to a temporary dir AFAIK. All that said, we decided in the past to reduce what ingest-attachment can actually extract and we kept only common files like pdf, open office, ...
The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
I'd not try to support compressed files TBH as it will consume a lot of memory. I'd instead uncompress locally files and send each of them to ingest, one by one.
My 2 cents.
Closing this issue. The extraction logic is far from ideal, since it is unknown from which file inside the archive the text content originated. Also there is a runtime risk, because the uncompressing is heavy and many files may exist in a seemly small archive, which would cause a very large document that then needs to be indexed.
Elasticsearch version 6.6.0
Plugins installed: [ingest attachment]
JVM version (
java -version
): openjdk version "11.0.1" 2018-10-16 OpenJDK Runtime Environment 18.9 (build 11.0.1+13) OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode)OS version (
uname -a
if on a Unix-like system): opensuse 42.3 Linux elastic 4.4.76-1-default #1 SMP Fri Jul 14 08:48:13 UTC 2017 (9a2885c) x86_64 x86_64 x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
I'm upgrading an existing app using elasticsearch v2.1.2 (+ attachment mapper plugin) to v6.6.0 (+ ingest attachment plugin). String searches in the 2.1.2 version return hits which are within compressed attachment files (eg .tar .tar.gz), as well as the usual .pdf .doc .xls etc.
Using the new ingest-attachment plugin, I see that compressed files do not appear to be processed: content-type is correctly deduced as "application/gzip", content-length is zero and no other fields are present in the attachment structure returned by elastic. For uncompressed files elastic also returns date, author, language and content fields!
I saw no compression related options in the docs at https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html
I do not know how to get the plugin versions - they're not in the log. elastic is running in a docker container produced with this Dockerfile content:
Steps to reproduce:
Index template:
Ingest-pipeline template:
Provide logs (if relevant): I don't see anything relevant - but here it is for completeness: