elastic / connectors

Source code for all Elastic connectors, developed by the Search team at Elastic, and home of our Python connector development framework
https://www.elastic.co/guide/en/enterprise-search/master/index.html
Other
70 stars 125 forks source link

[Gmail Connector] Email cleaning #1369

Open llermaly opened 1 year ago

llermaly commented 1 year ago

Problem Description

Currently the connector is extracting the entire raw email and this is in most cases insufficient for correct usage, specially when adding ML pipelines:

image

After some clarification from @seanstory and @danajuratoni I can know that parsing emails is not trivial, but the expectation is see just text, like a web crawler.

Proposed Solution

This said we can only offer a best-effort approach to parse the text because of the infinite ways of building an email.

@seanstory proposed using the attachment processor to send the body as .eml file and let tika do the extraction which I like because it aligns with the stack. (Crawlers "ingest full HTML option")

We could make this the default behavior with the option to disable it to support edge cases, and add documentation about how to handle those edge cases.

danajuratoni commented 1 year ago

The Attachment Processor should extract any of the mail formats supported by Tika. Have you adjusted the supported file types to test extraction via pipelines?

seanstory commented 1 year ago

looking at the code, I think this is actually the output of the Attachment processor already. https://github.com/elastic/connectors-python/blob/35bccad7fa1406c413fe3c870c25ce2db0faf64c/connectors/sources/gmail.py#L82-L100

CC @timgrein