Open arvindpdmn opened 4 years ago
BTW, I'm running code in a Jupyter Notebook. So, not sure how to disable asserts.
Did you find any solution ?
Nope
As a workaround, maybe preprocess your bad html with pytidylib?
here's a minimal reproducer:
>>> import html2text
>>> html2text.html2text('hi <img src> there')
Traceback (most recent call last):
...
File "/.../python-3.9.4+/lib/python3.9/site-packages/html2text/__init__.py", line 502, in handle_tag
assert attrs["src"] is not None
AssertionError
@wbolster and here is the corresponding workaround I mentioned earlier
>>> import html2text
>>> from tidylib import tidy_document
>>>
>>> document, errors = tidy_document('hi <img src> there')
>>> html2text.html2text(document)
'hi ![]() there\n\n'
Some web pages have errors. Rather than simply throwing an exception, it would be better to ignore benign errors and convert as much of the page as possible.
Version by
html2text --version
: 2020.1.16Python version
python --version
: Python 3.7.7Test script:
rsp = requests.get('https://blog.logrocket.com/from-rest-to-graphql/') h2t = html2text.HTML2Text() h2t.ignore_links = True h2t.bypass_tables = False text = h2t.handle(rsp.text)
File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text__init.py", line 142, in handle self.feed(data) File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text__init__.py", line 139, in feed super().feed(data) File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 111, in feed self.goahead(0) File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 171, in goahead k = self.parse_starttag(i) File "c:\users\arvindpdmn\miniconda3\lib\html\parser.py", line 345, in parse_starttag self.handle_starttag(tag, attrs) File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\init.py", line 191, in handle_starttag self.handle_tag(tag, dict(attrs), start=True) File "c:\users\arvindpdmn\miniconda3\lib\site-packages\html2text\init__.py", line 502, in handle_tag assert attrs["src"] is not None AssertionError