html_minify wraps result with unwanted '<html><head></head><body> ... </body></html>'

cobrateam / django-htmlmin

HTML minifier for Python frameworks (not only Django, despite the name).

http://pypi.python.org/pypi/django-htmlmin

BSD 2-Clause "Simplified" License

542 stars 73 forks source link

html_minify wraps result with unwanted '<html><head></head><body> ... </body></html>' #41

Open jheasly opened 11 years ago

jheasly commented 11 years ago

I'm minifying a fragment of an HTML doc and html_minify prepends with <html><head></head><body> and appends a </body></html>, neither of which I want.

For example:

>>> from htmlmin.minify import html_minify
>>> html = 'boo'
>>> html_minify(html)
u'<html><head></head><body>boo</body></html>'

Is there a way to turn this behavior off?

andrewsmedina commented 11 years ago

htmlmin uses beautifulsoup and beautifulsoup does this conversion automatically.

andrewsmedina commented 11 years ago

actually, the html5lib does this automatic conversion.

We can add a parameter to be able to set parser which will be used for minification.

Or fix this behavior.

jheasly commented 11 years ago

Hrm. Yeah, even telling BeautifulSoup to use the 'lxml' parser here (and here — not sure why soup gets set twice) instead of 'html5lib' doesn't help much. The output still gets wrapped:

u'<html><body><p>boo</p></body></html>'

But at least the lxml parser leaves out the <head> tags!

Being able to set a parameter would be really nice, but the parsers seem to have other ideas! In any event, thanks for your reply.

bernardobarreto commented 11 years ago

"html.parser" doesn't do that, but it breaks a lot of tests. Since the tags are in the begining and at the end of the code, I think removing it wouldn't demage the performance.

bernardobarreto commented 11 years ago

What do you guys think should be done?

andrewsmedina commented 11 years ago

a way is we create our own html5 parser

bernardobarreto commented 11 years ago

That's a great ideia :+1:

moeffju commented 10 years ago

wat. No it isn’t. Writing a HTML parser is HARD. Do you seriously think creating, let alone maintaining another one is realistic and smart?

Your problem is that html5lib (and BeautifulSoup) were made for the use case of documents. Fragments are not valid HTML documents, so they trigger automatic element insertion. On serialization, you get a complete (valid) HTML5 document. The question is why you are sending HTML fragments as text/html when your response is not a HTML document – just send a different Content-Type and you’ll be fine.

jheasly commented 10 years ago

In my use case, other processing happens outside html_minify before the document is assembled/output. I use BeautifulSoup all the time in less-than-document situations and it doesn't insist the output be a document.

If html5lib is strictly for docs, it's odd that its github page says "Standards-compliant library for parsing and serializing HTML documents and fragments in Python" (emphasis added).

ET-CS commented 10 years ago

I experienced that either. I've used instead for fragments:

>>>html = "boo"
>>>import htmlmin
>>>print htmlmin.minify(html)
boo

prerequisite: pip install htmlmin

slafs commented 8 years ago

This bug and e6594376c13 bit us recently. When we return errors we just want them to be simple texts. Now django-htmlmin is wrapping that text with HTML tags. I realize our approach isn't the best one but still I'd like to ask about progress of this issue. Is there any option we could use in our Django project so that django-htmlmin would stop wrapping text with HTML tags?

adamtay82 commented 8 years ago

This makes it very difficult to use django-htmlmin - would love to see a fix!

osamak commented 7 years ago

@moeffju htmlmin explicitly requires Content-Type to be text/html before minifying, so that is not a solution.

moeffju commented 7 years ago

@osamak I might have been unclear in my comment. htmlmin is not ready-made for the use case of serializing to fragments. It can parse fragments just fine, and in fact a lot of tags are optional for proper parsing (see https://html.spec.whatwg.org/multipage/syntax.html#syntax-tag-omission). It’s just that BeautifulSoup and html_minify do not walk the tree themselves for serialization, so they get the default behavior. html5lib can parse and deparse fragments fine, as evidenced by https://html5lib.readthedocs.io/en/latest/movingparts.html#htmlserializer

moeffju commented 7 years ago

Without running the code, I believe the problems stems from https://github.com/cobrateam/django-htmlmin/blob/master/htmlmin/minify.py#L59 – if you serialize the html5lib object tree manually instead of casting to string, it should be just fine. But http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L1119 does not seem to expose that option, so you might need to do it manually.

Cannot validate this hypothesis because I don’t have a working python env handy.

moeffju commented 7 years ago

@osamak Also “requires the Content-Type text/html” is exactly my point: If you are delivering content with Content-Type text/html then your content should be a complete HTML document. A fragment isn’t, so you should not send it with Content-Type text/html.

dev-codiyapa commented 6 years ago

Is there any solution for this ? I am getting same issues . So, it's clear that it's not fixed with current available version .