Plug the ingest attachment

Problem Description

Right now we only extract a limited list of files. Let's use the ingest pipeline. Until Tika is available on edge (https://github.com/elastic/connectors-python/issues/167)

Proposed Solution

use https://www.elastic.co/guide/en/elasticsearch/reference/8.5/attachment.html

by adding an attachment field with the content of the file

we also need to put an upper limit. 10M for now (configurable).

Notice that all of this is automated in our pipeline; you need to add a field named _attachment if successful it will be replaced in Elastic by a body field with the extracted content.

Todo:

[ ] add a function in utils.py that converts a file passed as an iterator into a base64 encoded value
[ ] use the function in NetworkDrive and S3 to create an _attachment field
[ ] modify the functional test so we have a few PDFs to test out
[ ] change verify.py to make sure that _attachment is gone
[ ] redo a perf test to verify it does not blow in memory
[ ] add a 10MiB limit on the size of the file, if bigger, discard it with a warning

elastic / connectors

Plug the ingest attachment #200

Problem Description

Proposed Solution

Alternatives

Additional Context