by adding an attachment field with the content of the file
we also need to put an upper limit. 10M for now (configurable).
Notice that all of this is automated in our pipeline; you need to add a field named _attachment if successful it will be replaced in Elastic by a body field with the extracted content.
Todo:
[ ] add a function in utils.py that converts a file passed as an iterator into a base64 encoded value
[ ] use the function in NetworkDrive and S3 to create an _attachment field
[ ] modify the functional test so we have a few PDFs to test out
[ ] change verify.py to make sure that _attachment is gone
[ ] redo a perf test to verify it does not blow in memory
[ ] add a 10MiB limit on the size of the file, if bigger, discard it with a warning
Problem Description
Right now we only extract a limited list of files. Let's use the ingest pipeline. Until Tika is available on edge (https://github.com/elastic/connectors-python/issues/167)
Proposed Solution
use https://www.elastic.co/guide/en/elasticsearch/reference/8.5/attachment.html
by adding an attachment field with the content of the file
we also need to put an upper limit. 10M for now (configurable).
Notice that all of this is automated in our pipeline; you need to add a field named
_attachment
if successful it will be replaced in Elastic by abody
field with the extracted content.Todo:
_attachment
fieldverify.py
to make sure that_attachment
is goneAlternatives
Additional Context