Closed attilabalazsy closed 3 months ago
I've been playing around with adding new patterns - you can do it with importing const and adding your own language:
from mailparser_reply.constants import MAIL_LANGUAGES
# new dict with 'wrote_header', 'from_header', 'disclaimers', 'signatures', 'sent_from' keys that have patterns
MAIL_LANGUAGES["my_lang"] = {"from_header": "AAAAAAAAAAAA"}
mail_message = EmailReplyParser(default_language="my_lang")
This approach feels wrong, but it's currently the only option. The API doesn't allow custom regexes and uses hardcoded constants.
In order to add a new language, you have to create a pull request, updating the library (constants.py) to support said new language.
See previous pull requests for updating the language: https://github.com/alfonsrv/mail-parser-reply/pull/5, https://github.com/alfonsrv/mail-parser-reply/pull/6, https://github.com/alfonsrv/mail-parser-reply/pull/7, https://github.com/alfonsrv/mail-parser-reply/pull/8
Alternatively you will have to use a method called monkey patching to temporarily add your language during runtime.
Thanks for the hints. I've prepared the needed regex filters for Portuguese language and will test it.
BTW: How the library should work? Is it trying all possible languages (or the selected ones)? Or is it trying the default language only?
It is trying to apply all languages' regex passed to the parser upon initialisation. So technically the more languages you initialise it in, the higher the probability for false positives.
However if you just use it to separate replies from each other, using all languages should be quite reliable.
Thanks for your help. I only need to parse out the latest reply, so that is fine. Here is my PT localization. I do not speak Portuguese, just used some sample emails.
MAIL_LANGUAGES["pt"] = { 'wrote_header': r'^(?!Em.Em\s.+?escreveu.:)', # this may not be correct 'from_header': r"((?:(?:^|\n|\n(?:> ?))[ ](?:De|Enviado\sel|Para|Asunto|CC):(?:\s{0,2}).){2,}(?:\n.*){,1})", 'disclaimers': [ 'AVISO', ], 'signatures': [ r'Melhores cumprimentos', r'Com os melhores cumprimentos,', r'Atenciosamente',#Sincerely r'Atentamente', #Sincerely r'Cordialmente', #Cordially r'Grat(o|a)', #Grateful r'Obrigad(o|a)', #Thank you r'Cumprimentos', #Regards ], 'sent_from': 'Enviado de' }
Is there a possibility to define new language during runtime? If yes, how?