Closed LennyLip closed 9 years ago
That's odd, on my machine I get:
In [5]: parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Out[5]: u'<p><iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen="allowfullscreen"></iframe></p>'
This is with BS 3.2.1, so I'm guessing BS4 is the problem.
I installed bs4 and micawber in a virtualenv and even with bs4 I still get the correct HTML:
In [6]: parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Out[6]: u'<p><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></p>'
In [7]: from micawber.parsers import BeautifulSoup
In [8]: BeautifulSoup.__module__
Out[8]: 'bs4'
# python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'
>>> from micawber.parsers import BeautifulSoup
>>> BeautifulSoup.__module__
'bs4'
>>> soup = BeautifulSoup('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')
>>> str(soup)
'<html><body><p>http://www.youtube.com/watch?v=54XHDUOHuzU</p></body></html>'
>>> unicode(soup)
u'<html><body><p>http://www.youtube.com/watch?v=54XHDUOHuzU</p></body></html>'
i think problem is like:
http://stackoverflow.com/questions/14822188/dont-put-html-head-and-body-tags-automatically-beautifulsoup http://stackoverflow.com/questions/15980757/how-to-prevent-beautifulsoup4-from-adding-extra-htmlbody-tags-to-the-soup
ok, what about "parse_text" ?
>>> soup = BeautifulSoup('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', "html.parser")
>>> str(soup)
'<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>'
>>> soup = BeautifulSoup('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')
>>> str(soup)
'<html><body><p>http://www.youtube.com/watch?v=54XHDUOHuzU</p></body></html>'
>>> str(soup.body.next)
'<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>'
BeautifulSoup Version: 4.3.2
Nice find, thanks! Added explicit html.parser
argument for bs4.
it does not helped for me :(
i've added two assert to function:
def parse_html(html, providers, urlize_all=True, handler=full_handler, block_handler=inline_handler, **params):
if not BeautifulSoup:
raise Exception('Unable to parse HTML, please install BeautifulSoup or use the text parser')
soup = BeautifulSoup(html, **bs_kwargs)
assert False, str(soup) # 1st assert
for url in soup.findAll(text=re.compile(url_re)):
if not _inside_skip(url):
if _is_standalone(url):
url_handler = handler
else:
url_handler = block_handler
url_unescaped = url.string
replacement = parse_text_full(url_unescaped, providers, urlize_all, url_handler, **params)
url.replaceWith(BeautifulSoup(replacement))
assert False, str(soup) # 2st assert
1st assert :
>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/micawber/parsers.py", line 130, in parse_html
assert False, str(soup) # 1st asset
AssertionError: <p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>
2nd assert:
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/micawber/parsers.py", line 143, in parse_html
assert False, str(soup) # 2st assert
AssertionError: <p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p>
then i've been added "html.parser" ( https://github.com/coleifer/micawber/blob/59ba3c03b0df42f010cb47ed61f0b92db6dbde20/micawber/parsers.py#L138 ):
url.replaceWith(BeautifulSoup(replacement, "html.parser" ))
new second assertion:
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/micawber/parsers.py", line 143, in parse_html
assert False, str(soup) # 2st assert
AssertionError: <p><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></p>
then it starts work fine:
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<p><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></p>'
Ohh, nice catch, I guess we need those kwargs in the second part too. Thanks.
What is html and body tags ? i do not need it.
I don't want link, i want iframe, etc, as in docs, even i have other tags in text.
I use bs4, but why it is not in docs as dependency?
ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)