Open jheasly opened 11 years ago
htmlmin uses beautifulsoup and beautifulsoup does this conversion automatically.
actually, the html5lib does this automatic conversion.
We can add a parameter to be able to set parser which will be used for minification.
Or fix this behavior.
Hrm. Yeah, even telling BeautifulSoup to use the 'lxml' parser here (and here — not sure why soup
gets set twice) instead of 'html5lib' doesn't help much. The output still gets wrapped:
u'<html><body><p>boo</p></body></html>'
But at least the lxml parser leaves out the <head>
tags!
Being able to set a parameter would be really nice, but the parsers seem to have other ideas! In any event, thanks for your reply.
"html.parser" doesn't do that, but it breaks a lot of tests. Since the tags are in the begining and at the end of the code, I think removing it wouldn't demage the performance.
What do you guys think should be done?
a way is we create our own html5 parser
That's a great ideia :+1:
wat. No it isn’t. Writing a HTML parser is HARD. Do you seriously think creating, let alone maintaining another one is realistic and smart?
Your problem is that html5lib (and BeautifulSoup) were made for the use case of documents. Fragments are not valid HTML documents, so they trigger automatic element insertion. On serialization, you get a complete (valid) HTML5 document. The question is why you are sending HTML fragments as text/html when your response is not a HTML document – just send a different Content-Type and you’ll be fine.
In my use case, other processing happens outside html_minify
before the document is assembled/output. I use BeautifulSoup all the time in less-than-document situations and it doesn't insist the output be a document.
If html5lib is strictly for docs, it's odd that its github page says "Standards-compliant library for parsing and serializing HTML documents and fragments in Python" (emphasis added).
I experienced that either. I've used instead for fragments:
>>>html = "boo"
>>>import htmlmin
>>>print htmlmin.minify(html)
boo
prerequisite: pip install htmlmin
This bug and e6594376c13 bit us recently. When we return errors we just want them to be simple texts. Now django-htmlmin is wrapping that text with HTML tags. I realize our approach isn't the best one but still I'd like to ask about progress of this issue. Is there any option we could use in our Django project so that django-htmlmin would stop wrapping text with HTML tags?
This makes it very difficult to use django-htmlmin - would love to see a fix!
@moeffju htmlmin explicitly requires Content-Type to be text/html
before minifying, so that is not a solution.
@osamak I might have been unclear in my comment. htmlmin
is not ready-made for the use case of serializing to fragments. It can parse fragments just fine, and in fact a lot of tags are optional for proper parsing (see https://html.spec.whatwg.org/multipage/syntax.html#syntax-tag-omission). It’s just that BeautifulSoup and html_minify do not walk the tree themselves for serialization, so they get the default behavior. html5lib
can parse and deparse fragments fine, as evidenced by https://html5lib.readthedocs.io/en/latest/movingparts.html#htmlserializer
Without running the code, I believe the problems stems from https://github.com/cobrateam/django-htmlmin/blob/master/htmlmin/minify.py#L59 – if you serialize the html5lib object tree manually instead of casting to string, it should be just fine. But http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1119 does not seem to expose that option, so you might need to do it manually.
Cannot validate this hypothesis because I don’t have a working python env handy.
@osamak Also “requires the Content-Type text/html” is exactly my point: If you are delivering content with Content-Type text/html
then your content should be a complete HTML document. A fragment isn’t, so you should not send it with Content-Type text/html
.
Is there any solution for this ? I am getting same issues . So, it's clear that it's not fixed with current available version .
I'm minifying a fragment of an HTML doc and html_minify prepends with
<html><head></head><body>
and appends a</body></html>
, neither of which I want.For example:
Is there a way to turn this behavior off?