Alir3z4 / html2text

Convert HTML to Markdown-formatted text.
alir3z4.github.io/html2text/
GNU General Public License v3.0
1.76k stars 270 forks source link

AssertionError for img src attribute #334

Open arvindpdmn opened 3 years ago

arvindpdmn commented 3 years ago

Some web pages have errors. Rather than simply throwing an exception, it would be better to ignore benign errors and convert as much of the page as possible.

rsp = requests.get('https://blog.logrocket.com/from-rest-to-graphql/') h2t = html2text.HTML2Text() h2t.ignore_links = True h2t.bypass_tables = False text = h2t.handle(rsp.text)


- Log:

File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text__init.py", line 142, in handle self.feed(data) File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text__init__.py", line 139, in feed super().feed(data) File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 111, in feed self.goahead(0) File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 171, in goahead k = self.parse_starttag(i) File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 345, in parse_starttag self.handle_starttag(tag, attrs) File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\init.py", line 191, in handle_starttag self.handle_tag(tag, dict(attrs), start=True) File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\init__.py", line 502, in handle_tag assert attrs["src"] is not None AssertionError

arvindpdmn commented 3 years ago

BTW, I'm running code in a Jupyter Notebook. So, not sure how to disable asserts.

rhlr commented 3 years ago

Did you find any solution ?

arvindpdmn commented 3 years ago

Nope

jeremydouglass commented 3 years ago

As a workaround, maybe preprocess your bad html with pytidylib?

wbolster commented 3 years ago

here's a minimal reproducer:

>>> import html2text
>>> html2text.html2text('hi <img src> there')
Traceback (most recent call last):
...
  File "/.../python-3.9.4+/lib/python3.9/site-packages/html2text/__init__.py", line 502, in handle_tag
    assert attrs["src"] is not None
AssertionError
jeremydouglass commented 3 years ago

@wbolster and here is the corresponding workaround I mentioned earlier

>>> import html2text
>>> from tidylib import tidy_document
>>> 
>>> document, errors = tidy_document('hi <img src> there')
>>> html2text.html2text(document)
'hi ![]() there\n\n'