MAIF / melusine

📧 Melusine: Use python to automatize your email processing workflow
https://maif.github.io/melusine
Other
352 stars 58 forks source link

body extraction #39

Closed caronpe closed 4 years ago

caronpe commented 4 years ago

Hello ! Your project looks awesome and I really want to try it. I see in your project that you use very clean text in your body data. In "real" email life the content of mail body is very dirty (HTML, encoding, formating, multipart, different language...). Did you manage it ? (or maybe you work only with your internal company emails ?)

Faithfully

ghost commented 4 years ago

Melusine has the core code to deal with emails in every languages since the user (you and me) give it the patterns (in english, in spanish, for the new mail-box we haven't seen yet).

Melusine was designed for the company emails in french. Those emails has already many and many shapes because they come from several mail-boxes.

The configuration of Melusine provides many patterns for french emails. You can complete/replace it with english patterns. See the customization of Melusine in https://github.com/MAIF/melusine/blob/master/tutorial/tutorial10_conf_file.ipynb

If you are familiar with regular expression you can copy the conf.json file to a custom_conf.json and adapt Melusine to your needs.

You can replace the regular expressions that work for french by regular expressions that work for english.

An example in python :

Add some custom patterns (regular expressions) to the "build_historic" part (split the message into many messages)

import os
import json
from melusine.config.config import ConfigJsonReader
conf_melusine = ConfigJsonReader()
conf_melusine.reset_config_path()
conf_dict = conf_melusine.get_config_file()

add_to_build_historic = [
    r">?\s*The[^;\n]{0,30}[;|\n]{0,1}[^;\n]{0,30}at[^;\n]{0,30};{0,1}[^;\n]{0,30}written\s*:?.{,100}?(?:\n[A-Z][A-Za-z]{,2}:|>{3}).*?\n",
    r"^(?:From|at|The|Cci?|Object|Date|Subject): .*?\n\s*",
    r">+.{,70}Real address .*?\n\s*",
    r"^\*{3,}\s+",
    r"TR\s?:.*?\n",
    r"Fwd\s?:.*?\n",
    r"(?:Hello.{,10}\s*)You have contacted the .{5,80} Our response :",
]

conf_dict["regex"]["build_historic"]["transition_list"] = (
    add_to_build_historic + conf_dict["regex"]["build_historic"]["transition_list"]
)

and for adding a pattern of flagging (the addresses here)

conf_dict["regex"]["cleaning"]["flags_dict"][
    r"\s*[0-9]{1,4}\s*(?:street|avenue|boulevard|road)(?:\s|of){,5}(?:(?:\s|\,)?\b\w+\b(?:\s|\'|\,)?){,6}(?:(?:\s|\'|\,|\-)?(?:\b[A-Z]+\w+\b|flag_cp_)(?:\s|\'|\,|\-)?){,3}"
] = " flag_adress_ "

and for adding footers

add_to_footer = [
    r"powered by .*?(?:(?:https?:\/\/)?(?:www\.)?[-a-zA-Z0-9:%._\\+~#=]{2,256}\.[a-zA-Z]{2,4}(?:[-a-zA-Z0-9:%_\\+.~#?&\/=]*)|\s|\b\w+\b|\(|\)){,12}",
    r"Ce message a [ée]t[ée] g[ée]n[ée]r[ée] automatiquement par [a-z-A-Z-0-9() .]{,50}",
    r"This e-mail and any attachments.*system.",
    r"This message may contains.{,80}electronic communication see",
    r"You also can consult.{5,250} you asked",
    r"This message was automatically generated by.*?\n"
]
conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["FOOTER"] = (
    add_to_footer + conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["FOOTER"]
)

add a "HELLO" pattern

add_to_hello = [r".{,40}happy?\s*new.{,30}", "(?:hello|hi).{,20}"]
conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["HELLO"] = (
    add_to_hello + conf_dict["regex"]["mail_segmenting"]["segmenting_dict"]["HELLO"]
)

Finally replace the default Melusine conf file by your custom conf file

path_to_custom_melusine = os.path.join(os.environ["CONF"], "custom_melusine_conf.json")
with open(path_to_custom_melusine, "w", encoding="utf-8") as jsonFile:
    json.dump(conf_dict, jsonFile, indent=4, ensure_ascii=False)
conf_melusine.set_config_path(file_path=path_to_custom_melusine)
print(path_to_custom_melusine)
print("Melusine custom file edited : ", path_to_custom_melusine)

Then the main amount of work is to find those regular expressions.

TFA-MAIF commented 4 years ago

Hello @caronpe ! Thank you !

To complete the answer. A lot of cleaning and formatting are already done by melusine (text_to_lowercase, remove_accents, remove_line_break, remove_superior_symbol, remove_apostrophe, remove_multiple_spaces_and_strip_text, etc. or build_historic to detect and extract mail of conversation with multiple reply and transfer)

The project is already in use for everyday mail received by a french insurance company (15K/day), not only internal company emails.

caronpe commented 4 years ago

Thank you for your long and very precise answer ! Thanks to this I discover more deeper your project and the features you provide.

Actually I'm more in the parsing data step and @TFA-MAIF answer me : I'm actually using email python library but like you say "it's never perfect". Extract clean text data from email can be real nightmare...

I will follow your project and I hope contribute !