elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.36k stars 24.87k forks source link

Support setting parameters for Tika used in Ingest Attachment Processor Plugin #71164

Open PengYi-Elastic opened 3 years ago

PengYi-Elastic commented 3 years ago

Since ingest attachment plugin is using Apache Tika, it would be better if we can set parameters for Tika directly.

Here is config file used . https://tika.apache.org/1.25/configuring.html

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.DefaultParser"/>
    <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
      <params>
        <param name="concatenatePhoneticRuns" type="bool">false</param>
      </params>
    </parser>
    <parser class="org.apache.tika.parser.microsoft.OfficeParser">
      <params>
        <param name="concatenatePhoneticRuns" type="bool">false</param>
      </params>
    </parser>
  </parsers>
</properties>

While trying to run ES with the config file in java.io.tmpdir, there is an exception raised by the security manager of not accessible permission.

$ TIKA_CONFIG=/tmp/elasticsearch-1234/config.xml ES_TMPDIR=/tmp/elasticsearch-1234/ ./bin/elasticsearch
$ sudo jinfo 31562 | grep java.io.tmpdir
java.io.tmpdir=/tmp/elasticsearch-1234/

Stack Trace

Caused by: java.security.AccessControlException: access denied ("java.lang.reflect.ReflectPermission" "suppressAccessChecks")
at java.base/java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
at java.base/java.security.AccessController.checkPermission(AccessController.java:1036)
at java.base/java.lang.SecurityManager.checkPermission(SecurityManager.java:408)
at java.base/java.lang.reflect.AccessibleObject.checkPermission(AccessibleObject.java:87)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:180)
at org.apache.tika.config.Param.getTypedValue(Param.java:209)
at org.apache.tika.config.Param.getValue(Param.java:133)
at org.apache.tika.utils.AnnotationUtils.assignFieldParams(AnnotationUtils.java:115)

Here are Excel file and request used to reproduce the issue. Config is provided above. Binary of tokyo.xlsx file is used in the kibana request. tokyo.xlsx kibana_request.txt

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)