Open llermaly opened 1 year ago
The Attachment Processor should extract any of the mail formats supported by Tika. Have you adjusted the supported file types to test extraction via pipelines?
looking at the code, I think this is actually the output of the Attachment processor already. https://github.com/elastic/connectors-python/blob/35bccad7fa1406c413fe3c870c25ce2db0faf64c/connectors/sources/gmail.py#L82-L100
CC @timgrein
Problem Description
Currently the connector is extracting the entire raw email and this is in most cases insufficient for correct usage, specially when adding ML pipelines:
After some clarification from @seanstory and @danajuratoni I can know that parsing emails is not trivial, but the expectation is see just text, like a web crawler.
Proposed Solution
This said we can only offer a best-effort approach to parse the text because of the infinite ways of building an email.
@seanstory proposed using the attachment processor to send the body as .eml file and let tika do the extraction which I like because it aligns with the stack. (Crawlers "ingest full HTML option")
We could make this the default behavior with the option to disable it to support edge cases, and add documentation about how to handle those edge cases.