SpamScope / mail-parser

Tokenizer for raw mails
https://pypi.python.org/pypi/mail-parser
Apache License 2.0
367 stars 87 forks source link

Handle multi part/ alternative text emails? #111

Closed CaptainDriftwood closed 1 week ago

CaptainDriftwood commented 2 years ago

Is your feature request related to a problem? Please describe. I'm having trouble parsing email bodies that are multipart.

Describe the solution you'd like To correctly parse multipart email bodies.

Describe alternatives you've considered Simply falling back to the standard library email parsing functionality.

Additional context Add any other context or screenshots about the feature request here.

lebigot commented 1 year ago

I have the same issue: it would be very useful to be able to know when a .text_plain item is simply the text version of a .text_html item. Otherwise it is hard to extract, say, a text version of a message (often we have both a plain and an HTML version of it; sometimes there is only the HTML version; in principle there could be some plain text and then some unrelated HTML, if I'm not mistaken; etc.).

Maybe a solution would be to add a .text attribute with the following structure?

[
  (text_plain0, alternative_text_html0),
  (None, unique_text_html1),  # No plain alternative
  (unique_text_plain2, None),  # No HTML alternative
  …
]
fedelemantuano commented 1 week ago

To review in the new version. I will open a new issue if needed.