elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.15k stars 24.84k forks source link

Ingest-attachment does not handle compressed files #40602

Closed sronsiek closed 5 years ago

sronsiek commented 5 years ago

Elasticsearch version 6.6.0

Plugins installed: [ingest attachment]

JVM version (java -version): openjdk version "11.0.1" 2018-10-16 OpenJDK Runtime Environment 18.9 (build 11.0.1+13) OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode)

OS version (uname -a if on a Unix-like system): opensuse 42.3 Linux elastic 4.4.76-1-default #1 SMP Fri Jul 14 08:48:13 UTC 2017 (9a2885c) x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

I'm upgrading an existing app using elasticsearch v2.1.2 (+ attachment mapper plugin) to v6.6.0 (+ ingest attachment plugin). String searches in the 2.1.2 version return hits which are within compressed attachment files (eg .tar .tar.gz), as well as the usual .pdf .doc .xls etc.

Using the new ingest-attachment plugin, I see that compressed files do not appear to be processed: content-type is correctly deduced as "application/gzip", content-length is zero and no other fields are present in the attachment structure returned by elastic. For uncompressed files elastic also returns date, author, language and content fields!

I saw no compression related options in the docs at https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

I do not know how to get the plugin versions - they're not in the log. elastic is running in a docker container produced with this Dockerfile content:

FROM docker.elastic.co/elasticsearch/elasticsearch:6.6.0

RUN bin/elasticsearch-plugin install --batch ingest-attachment

Steps to reproduce:

Index template:

{
  "index_patterns": "ars*",
  "mappings": {
    "ar": {
      "properties": {
        "actionee": {
          "properties": {
            "fullname": {
              "fields": {
                "keyword": {
                  "ignore_above": 256,
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "name": {
              "fields": {
                "keyword": {
                  "ignore_above": 256,
                  "type": "keyword"
                }
              },
              "type": "text"
            }
          }
        },
        "attachments": {
          "properties": {
            "description": {
              "fields": {
                "keyword": {
                  "ignore_above": 256,
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "filename": {
              "fields": {
                "keyword": {
                  "ignore_above": 256,
                  "type": "keyword"
                }
              },
              "type": "text"
            },
            "filesize": {
              "type": "long"
            },
            "id": {
              "index": false,
              "type": "long"
            },
            "updated_at": {
              "type": "date"
            }
          }
        }
      }
    }
  },
  "order": 0,
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "whitespace"
        }
      }
    },
    "number_of_replicas": 0,
    "number_of_shards": 1,
    "refresh_interval": "1s"
  }
}

Ingest-pipeline template:

pipeline.json 
{
  "description" : "Extract attachment information from arrays",
  "processors" : [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.attachment",
            "field": "_ingest._value.data",
            "indexed_chars": -1
          }
        }
      }
    }
  ]
}

Provide logs (if relevant): I don't see anything relevant - but here it is for completeness:

Mar 28, 2019 5:25:21 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
[2019-03-28T17:25:21,630][INFO ][o.e.c.m.MetaDataMappingService] [oBnurzh] [ars/sEx9Jo9VQPGK-XOFs6r8Fg] update_mapping [ar]
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
OpenJDK 64-Bit Server VM warning: UseAVX=2 is not supported on this CPU, setting it to UseAVX=1
[2019-03-28T17:32:28,969][INFO ][o.e.e.NodeEnvironment    ] [864g5mA] using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/vdb1)]], net usable_space [46.7gb], net total_space [59.9gb], types [xfs]
[2019-03-28T17:32:28,973][INFO ][o.e.e.NodeEnvironment    ] [864g5mA] heap size [3.9gb], compressed ordinary object pointers [true]
[2019-03-28T17:32:28,977][INFO ][o.e.n.Node               ] [864g5mA] node name derived from node ID [864g5mAqS_WIS-BMu8DH-Q]; set [node.name] to override
[2019-03-28T17:32:28,977][INFO ][o.e.n.Node               ] [864g5mA] version[6.6.0], pid[1], build[default/tar/a9861f4/2019-01-24T11:27:09.439740Z], OS[Linux/4.4.76-1-default/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/11.0.1/11.0.1+13]
[2019-03-28T17:32:28,978][INFO ][o.e.n.Node               ] [864g5mA] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-12949752479696623950, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -XX:UseAVX=2, -Des.cgroups.hierarchy.override=/, -Xms4g, -Xmx4g, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=tar]
[2019-03-28T17:32:31,824][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [aggs-matrix-stats]
[2019-03-28T17:32:31,824][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [analysis-common]
[2019-03-28T17:32:31,825][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [ingest-common]
[2019-03-28T17:32:31,825][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [lang-expression]
[2019-03-28T17:32:31,825][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [lang-mustache]
[2019-03-28T17:32:31,826][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [lang-painless]
[2019-03-28T17:32:31,826][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [mapper-extras]
[2019-03-28T17:32:31,826][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [parent-join]
[2019-03-28T17:32:31,827][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [percolator]
[2019-03-28T17:32:31,827][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [rank-eval]
[2019-03-28T17:32:31,827][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [reindex]
[2019-03-28T17:32:31,827][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [repository-url]
[2019-03-28T17:32:31,828][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [transport-netty4]
[2019-03-28T17:32:31,828][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [tribe]
[2019-03-28T17:32:31,828][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-ccr]
[2019-03-28T17:32:31,829][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-core]
[2019-03-28T17:32:31,829][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-deprecation]
[2019-03-28T17:32:31,829][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-graph]
[2019-03-28T17:32:31,830][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-ilm]
[2019-03-28T17:32:31,830][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-logstash]
[2019-03-28T17:32:31,830][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-ml]
[2019-03-28T17:32:31,831][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-monitoring]
[2019-03-28T17:32:31,831][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-rollup]
[2019-03-28T17:32:31,831][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-security]
[2019-03-28T17:32:31,831][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-sql]
[2019-03-28T17:32:31,832][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-upgrade]
[2019-03-28T17:32:31,832][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded module [x-pack-watcher]
[2019-03-28T17:32:31,833][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded plugin [ingest-attachment]
[2019-03-28T17:32:31,833][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded plugin [ingest-geoip]
[2019-03-28T17:32:31,834][INFO ][o.e.p.PluginsService     ] [864g5mA] loaded plugin [ingest-user-agent]
[2019-03-28T17:32:38,546][INFO ][o.e.x.s.a.s.FileRolesStore] [864g5mA] parsed [0] roles from file [/usr/share/elasticsearch/config/roles.yml]
[2019-03-28T17:32:39,429][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [864g5mA] [controller/89] [Main.cc@109] controller (64 bit): Version 6.6.0 (Build bbb4919f4d17a5) Copyright (c) 2019 Elasticsearch BV
[2019-03-28T17:32:40,702][INFO ][o.e.d.DiscoveryModule    ] [864g5mA] using discovery type [zen] and host providers [settings]
[2019-03-28T17:32:42,024][INFO ][o.e.n.Node               ] [864g5mA] initialized
[2019-03-28T17:32:42,025][INFO ][o.e.n.Node               ] [864g5mA] starting ...
[2019-03-28T17:32:42,256][INFO ][o.e.t.TransportService   ] [864g5mA] publish_address {172.17.0.2:9300}, bound_addresses {0.0.0.0:9300}
[2019-03-28T17:32:42,278][INFO ][o.e.b.BootstrapChecks    ] [864g5mA] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-03-28T17:32:45,360][INFO ][o.e.c.s.MasterService    ] [864g5mA] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {864g5mA}{864g5mAqS_WIS-BMu8DH-Q}{cLx7yl7cQFSK4voEG18jqg}{172.17.0.2}{172.17.0.2:9300}{ml.machine_memory=12598550528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
[2019-03-28T17:32:45,368][INFO ][o.e.c.s.ClusterApplierService] [864g5mA] new_master {864g5mA}{864g5mAqS_WIS-BMu8DH-Q}{cLx7yl7cQFSK4voEG18jqg}{172.17.0.2}{172.17.0.2:9300}{ml.machine_memory=12598550528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {864g5mA}{864g5mAqS_WIS-BMu8DH-Q}{cLx7yl7cQFSK4voEG18jqg}{172.17.0.2}{172.17.0.2:9300}{ml.machine_memory=12598550528, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2019-03-28T17:32:45,451][INFO ][o.e.h.n.Netty4HttpServerTransport] [864g5mA] publish_address {172.17.0.2:9200}, bound_addresses {0.0.0.0:9200}
[2019-03-28T17:32:45,452][INFO ][o.e.n.Node               ] [864g5mA] started
[2019-03-28T17:32:45,472][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [864g5mA] Failed to clear cache for realms [[]]
[2019-03-28T17:32:45,543][INFO ][o.e.g.GatewayService     ] [864g5mA] recovered [0] indices into cluster_state
[2019-03-28T17:32:45,828][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.watch-history-9] for index patterns [.watcher-history-9*]
[2019-03-28T17:32:45,879][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.triggered_watches] for index patterns [.triggered_watches*]
[2019-03-28T17:32:45,922][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.watches] for index patterns [.watches*]
[2019-03-28T17:32:45,966][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.monitoring-logstash] for index patterns [.monitoring-logstash-6-*]
[2019-03-28T17:32:46,035][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.monitoring-es] for index patterns [.monitoring-es-6-*]
[2019-03-28T17:32:46,077][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.monitoring-alerts] for index patterns [.monitoring-alerts-6]
[2019-03-28T17:32:46,131][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.monitoring-beats] for index patterns [.monitoring-beats-6-*]
[2019-03-28T17:32:46,184][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [.monitoring-kibana] for index patterns [.monitoring-kibana-6-*]
[2019-03-28T17:32:46,347][INFO ][o.e.l.LicenseService     ] [864g5mA] license [58339812-447d-4735-b976-96d42134833b] mode [basic] - valid
[2019-03-28T17:34:58,789][WARN ][o.e.d.c.m.MetaDataCreateIndexService] [864g5mA] the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template
[2019-03-28T17:34:58,806][INFO ][o.e.c.m.MetaDataCreateIndexService] [864g5mA] [ars] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2019-03-28T17:34:59,406][INFO ][o.e.c.m.MetaDataMappingService] [864g5mA] [ars/ZLpXMDNaRFaSczn8QX_58A] create_mapping [ar]
[2019-03-28T17:34:59,637][INFO ][o.e.c.m.MetaDataDeleteIndexService] [864g5mA] [ars/ZLpXMDNaRFaSczn8QX_58A] deleting index
[2019-03-28T17:34:59,798][INFO ][o.e.c.m.MetaDataIndexTemplateService] [864g5mA] adding template [template-ars] for index patterns [ars*]
[2019-03-28T17:35:00,255][INFO ][o.e.c.m.MetaDataCreateIndexService] [864g5mA] [ars] creating index, cause [auto(bulk api)], templates [template-ars], shards [1]/[0], mappings [ar]
[2019-03-28T17:35:00,322][INFO ][o.e.c.r.a.AllocationService] [864g5mA] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[ars][0]] ...]).
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Mar 28, 2019 5:35:00 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Mar 28, 2019 5:35:01 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
[2019-03-28T17:35:01,599][INFO ][o.e.c.m.MetaDataMappingService] [864g5mA] [ars/3_efZOn0RwGA7_NkEEplQQ] update_mapping [ar]
elasticmachine commented 5 years ago

Pinging @elastic/es-core-features

dadoonet commented 5 years ago

IMO that's a bad idea to support it anyway. It will flatten all the content so you will never know where exactly the text is coming from. Also it means that you might want to send a lot of data to elasticsearch over the wire. Another problem is that Tika is not super efficient with compressed files and it needs to write to a temporary dir AFAIK. All that said, we decided in the past to reduce what ingest-attachment can actually extract and we kept only common files like pdf, open office, ...

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

I'd not try to support compressed files TBH as it will consume a lot of memory. I'd instead uncompress locally files and send each of them to ingest, one by one.

My 2 cents.

martijnvg commented 5 years ago

Closing this issue. The extraction logic is far from ideal, since it is unknown from which file inside the archive the text content originated. Also there is a runtime risk, because the uncompressing is heavy and many files may exist in a seemly small archive, which would cause a very large document that then needs to be indexed.