elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
70 stars 125 forks source link

Plug the ingest attachment #200

Closed tarekziade closed 1 year ago

tarekziade commented 1 year ago

Problem Description

Right now we only extract a limited list of files. Let's use the ingest pipeline. Until Tika is available on edge (https://github.com/elastic/connectors-python/issues/167)

Proposed Solution

use https://www.elastic.co/guide/en/elasticsearch/reference/8.5/attachment.html

by adding an attachment field with the content of the file

we also need to put an upper limit. 10M for now (configurable).

Notice that all of this is automated in our pipeline; you need to add a field named _attachment if successful it will be replaced in Elastic by a body field with the extracted content.

Todo:

Alternatives

Additional Context

serenachou commented 1 year ago

Is the limit configurable @tarekziade for the discard? and is this using the default defined pipeline for the connectors or something different?