elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
70 stars 125 forks source link

Include Tika in aws connector so the SUPPORTED_FILETYPE can included csv, json and xml files #167

Closed matt-isett closed 1 year ago

matt-isett commented 1 year ago

Problem Description

The new AWS connector connects to S3 - people place standard data file types here,i.e., log.json, table.csv, and old.xml files. Our current support types are for programing language files to be read. This isn't the normal place to keep your python, ruby, and shell scripts.

Proposed Solution

Since Tika is used throughout enterprise search to handle multiple file types, we should use it within AWS connector so we can expose the file types here as well.

Alternatives

The alternative is to pull in binary representation and use ingest pipeline (using tika) to perform the extraction.

Additional Context

This is an awesome connector + the directory one ++

Someone asked about arvo files stored in s3 - So I assume the list is endless of supported files. For something like avro - we should build a ingest pipeline to handle these non-tika types, I would assume.

tarekziade commented 1 year ago

Fixed in https://github.com/elastic/connectors-python/pull/214