coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
632 stars 91 forks source link

To parse HTML, install BeautifulSoup #94

Closed loleg closed 3 years ago

loleg commented 3 years ago

We get this error for some yet to be clarified reason. Is there a hidden dependency on the BeautifulSoup package?

File "/app/.heroku/python/lib/python3.9/site-packages/micawber/contrib/mcflask.py", line 21, in _oembed

2020-10-25T09:55:22.161763+00:00 app[web.1]:     return oembed(s, providers, urlize_all, html, **params)

2020-10-25T09:55:22.161763+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.9/site-packages/micawber/contrib/mcflask.py", line 10, in oembed

2020-10-25T09:55:22.161763+00:00 app[web.1]:     return Markup(fn(s, providers, urlize_all, **params))

2020-10-25T09:55:22.161764+00:00 app[web.1]:   File "/app/.heroku/python/lib/python3.9/site-packages/micawber/parsers.py", line 137, in parse_html

2020-10-25T09:55:22.161764+00:00 app[web.1]:     raise Exception('Unable to parse HTML, please install BeautifulSoup '

2020-10-25T09:55:22.161764+00:00 app[web.1]: Exception: Unable to parse HTML, please install BeautifulSoup or beautifulsoup4, or use the text parser
loleg commented 3 years ago

I see this mentioned deep in the docs.. https://micawber.readthedocs.io/en/latest/api.html?highlight=Beautifulsoup#micawber.providers.ProviderRegistry.parse_html

Perhaps it should not be optional, since in our case we can't foresee which provider will be used.

coleifer commented 3 years ago

You only need an HTML-parsing library if you are using the parse_html methods (or, if using the template filter, you are explicitly setting html=True). Some people pass links directly to micawber, while others are only parsing plaintext, so it makes sense to make the html-parsing optional. Furthermore, the error message makes it extremely clear how to resolve the issue.

loleg commented 3 years ago

I see, I thought this was provider dependent, because I could not reproduce it at first. But our issue also had to do with differences in the environments. Thanks for the insight!