coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
635 stars 91 forks source link

parse_html overhead #40

Closed LennyLip closed 9 years ago

LennyLip commented 9 years ago
>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'

What is html and body tags ? i do not need it.

>>> micawber.parse_text('http://www.youtube.com/watch?v=54XHDUOHuzU', providers)
u'<iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen></iframe>'
>>> micawber.parse_text('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<p><a href="http://www.youtube.com/watch?v=54XHDUOHuzU" title="Future Crew - Second Reality demo - HD">Future Crew - Second Reality demo - HD</a></p>'

I don't want link, i want iframe, etc, as in docs, even i have other tags in text.

I use bs4, but why it is not in docs as dependency?

ps. Python 2.7.3 (default, Mar 13 2014, 11:03:55)

coleifer commented 9 years ago

That's odd, on my machine I get:

In [5]: parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Out[5]: u'<p><iframe width="459" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" frameborder="0" allowfullscreen="allowfullscreen"></iframe></p>'

This is with BS 3.2.1, so I'm guessing BS4 is the problem.

coleifer commented 9 years ago

I installed bs4 and micawber in a virtualenv and even with bs4 I still get the correct HTML:

In [6]: parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Out[6]: u'<p><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></p>'

In [7]: from micawber.parsers import BeautifulSoup

In [8]: BeautifulSoup.__module__
Out[8]: 'bs4'
LennyLip commented 9 years ago
# python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<html><body><p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p></body></html>'

>>> from micawber.parsers import BeautifulSoup
>>> BeautifulSoup.__module__
'bs4'
>>> soup = BeautifulSoup('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')
>>> str(soup)
'<html><body><p>http://www.youtube.com/watch?v=54XHDUOHuzU</p></body></html>'
>>> unicode(soup)
u'<html><body><p>http://www.youtube.com/watch?v=54XHDUOHuzU</p></body></html>'

i think problem is like:

http://stackoverflow.com/questions/14822188/dont-put-html-head-and-body-tags-automatically-beautifulsoup http://stackoverflow.com/questions/15980757/how-to-prevent-beautifulsoup4-from-adding-extra-htmlbody-tags-to-the-soup

ok, what about "parse_text" ?

LennyLip commented 9 years ago
>>> soup = BeautifulSoup('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', "html.parser")
>>> str(soup)
'<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>'
>>> soup = BeautifulSoup('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>')
>>> str(soup)
'<html><body><p>http://www.youtube.com/watch?v=54XHDUOHuzU</p></body></html>'
>>> str(soup.body.next)
'<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>'

BeautifulSoup Version: 4.3.2

coleifer commented 9 years ago

Nice find, thanks! Added explicit html.parser argument for bs4.

LennyLip commented 9 years ago

it does not helped for me :(

i've added two assert to function:

def parse_html(html, providers, urlize_all=True, handler=full_handler, block_handler=inline_handler, **params):
    if not BeautifulSoup:
        raise Exception('Unable to parse HTML, please install BeautifulSoup or use the text parser')

    soup = BeautifulSoup(html, **bs_kwargs)

    assert False, str(soup) # 1st assert

    for url in soup.findAll(text=re.compile(url_re)):
        if not _inside_skip(url):
            if _is_standalone(url):
                url_handler = handler
            else:
                url_handler = block_handler

            url_unescaped = url.string
            replacement = parse_text_full(url_unescaped, providers, urlize_all, url_handler, **params)
            url.replaceWith(BeautifulSoup(replacement))

    assert False, str(soup) # 2st assert

1st assert :

>>> import micawber
>>> providers = micawber.bootstrap_basic()
>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/micawber/parsers.py", line 130, in parse_html
    assert False, str(soup) # 1st asset
AssertionError: <p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>

2nd assert:

>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/micawber/parsers.py", line 143, in parse_html
    assert False, str(soup) # 2st assert
AssertionError: <p><html><body><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></body></html></p>

then i've been added "html.parser" ( https://github.com/coleifer/micawber/blob/59ba3c03b0df42f010cb47ed61f0b92db6dbde20/micawber/parsers.py#L138 ):

url.replaceWith(BeautifulSoup(replacement, "html.parser" ))

new second assertion:

>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/micawber/parsers.py", line 143, in parse_html
    assert False, str(soup) # 2st assert
AssertionError: <p><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></p>

then it starts work fine:

>>> micawber.parse_html('<p>http://www.youtube.com/watch?v=54XHDUOHuzU</p>', providers)
u'<p><iframe allowfullscreen="" frameborder="0" height="344" src="http://www.youtube.com/embed/54XHDUOHuzU?feature=oembed" width="459"></iframe></p>'
coleifer commented 9 years ago

Ohh, nice catch, I guess we need those kwargs in the second part too. Thanks.