Better linebreak parsing with msg.get_body_text()

johanovic commented 4 years ago

I regularly extract the text of an html message. The current parsing method (below) fails to insert linebreaks where one would expect them. Is it possible to improve this? I could do this directly in lxml (with itertext), but it might be a good enhancement for the library as a whole.

def get_body_text(self):
    """ Parse the body html and returns the body text using bs4

    :return: body as text
    :rtype: str
    """
    if self.body_type.upper() != 'HTML':
        return self.body

    try:
        soup = bs(self.body, 'html.parser')
    except RuntimeError:
        return self.body
    else:
        return soup.body.text

alejcas commented 4 years ago

This is done by the beautifulsoup4 library. I don't want to add lxml or any other dependency so...

do you have any proposal on how to achieve this?

tylerlittlefield commented 5 months ago

Would be a really nice enhancement, I am experiencing the same thing. For those coming to this issue, you can try the following:

message = inbox.get_message("<SOME EMAIL ID>")
soup = message.get_body_soup()
delimiter = "\n\n"
for line_break in soup.findAll('br'):
    line_break.replaceWith(delimiter)
soup.get_text()

Source: https://stackoverflow.com/a/61423104/7362046

O365 / python-o365

Better linebreak parsing with msg.get_body_text() #482