coleifer / micawber

a small library for extracting rich content from urls
http://micawber.readthedocs.org/
MIT License
632 stars 91 forks source link

HTML parser doesn't deal with &amp #14

Closed pennersr closed 11 years ago

pennersr commented 11 years ago

Suppose you've got the following content:

Testing

http://picasaweb.google.com/lh/sredir?uname=test&target=ALBUM&id=123&authkey=abc

(Note: the link itself is not valid due to mangled IDs (it was a private album))

Rendering this content as follows will not work:

{{post.body|linebreaksbr|oembed_html}}

The reason is that the "&" has been escaped and turned into "&amp". The HTML parser over at https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L144 does recognize & extract the URL, but it does not unescape &amp. Hence, &amp is fed to embed.ly... resulting in a 404 over there.

coleifer commented 11 years ago

Hmm...

For starters, check the docs on the linebreaks filter (or the source). It appears that it accepts a parameter that controls what is escaped and will autoescape the text for you:

So you can simply not escape the stuff you pass into the filter.

But yeah, it probably needs to be able to unescape things. I will look into a fix.

pennersr commented 11 years ago

Thanks. The example I posted still does not seem to work, as in the end the example URL posted to (in my case) embed.ly is still as follows:

http://api.embed.ly/1/oembed?format=json&key=8af3ff804a6csecret&url=http%3A%2F%2Fpicasaweb.google.com%2Flh%2Fsredir%3Funame%3Dtest%26amp%3Btarget%3DALBUM%26amp%3Bid%3D123%26amp%3Bauthkey%3Dabc

Note that it contains amp. I didn't dive into it yet, though I can confirm that the amp is now correctly unescaped when extracted in parse_html ...

coleifer commented 11 years ago

Can you submit a testcase? Also note the way the testcases are stuctured so as not to make any network calls.

coleifer commented 11 years ago

You might also try putting "|oembed|linebreaksbr" instead.

pennersr commented 11 years ago

I'll dive into this, hopefully later today ...

(I do understand that I can workaround this by ordering filters differently, or by disabling auto escaping -- but the oembed_html tag is supposed to handle HTML so it should handle this case just fine)

coleifer commented 11 years ago

I hear ya, yeah -- if you can, could you also maybe get me:

  1. the initial text
  2. the text after piping through django
  3. the url that is sent off to embed.ly or processed by micawber

Thanks mane!

pennersr commented 11 years ago

Found the problem. As said, your change correctly turns & into &. However, the str(url) at this line ...

https://github.com/coleifer/micawber/blob/master/micawber/parsers.py#L130

... turns the ampersands back into & again.

The fix: change str(url) into url.string and all works fine.

coleifer commented 11 years ago

Cool, found it earlier too, was working on a testcase but got lazy... fix is merged in though.