cobrateam / django-htmlmin

HTML minifier for Python frameworks (not only Django, despite the name).
http://pypi.python.org/pypi/django-htmlmin
BSD 2-Clause "Simplified" License
542 stars 73 forks source link

Excessive whitespace removal #21

Open bradbeattie opened 13 years ago

bradbeattie commented 13 years ago

The extra whitespace in <a href="asdf">asdf</a>......<a href="asdf">asdf</a> Should probably collapse down to <a href="asdf">asdf</a>.<a href="asdf">asdf</a> But your code turns it into <a href="asdf">asdf</a><a href="asdf">asdf</a>

(spaces represented as periods here as the markdown is collapsing my whitespace)

jcampanello commented 13 years ago

Hi,

the issue happens also inside p tags that span multiple lines. For example:

<p>this paragraph spans
  multiple lines<p> 

shows on the browser as:

this paragraph spansmultiple lines

Note that not only the multiple leading spaces on the second line are removed, the new line is not converted to a white space, so the words spans and multiple get together.

Great tool BTW.

flavianmissi commented 13 years ago

Hi!

I'm working on this, but I had a problem reproducing the space problem between two tags... I wrote a test to test the exactly same link case that you showed up, but I couldn't reproduce it, the test is passing... could you tell me what version you're using...?

The problem reported by @jcampanello I was able to catch on the tests, but I'll not finish the code right now. I just wanted to give an update for you guys that we are working on that. :)

jcampanello commented 13 years ago

Hi Flavia,

the package got installed as "django_htmlmin-0.5-py2.6.egg" (i.e: html-min v 0.5 on python 2.6).

Do you have an idea of when do you plan to release the fix?

Thanks for the effort!

José Luis

flavianmissi commented 13 years ago

Howdy!

I'll probably solve this bug until the end of this week.

;)

jcampanello commented 13 years ago

Super! Thanks again!

flavianmissi commented 13 years ago

Howdy!

Please, update to 0.5.1 :)

jcampanello commented 13 years ago

Flavia,

did the upgrade. Everything seems ok.

Thanks again!

flavianmissi commented 13 years ago

According to @bradbeattie, there are still issues with white spaces removal between two tags. e.g. <button>First</button> <button>Second</button>

I'm gonna take a look at this, and see if makes sense to fix, as soon as I can :)

bradbeattie commented 13 years ago

Note that buttons are just one example. Inline divs will also be an issue, but you can't tell whether a div will be online or not without deep inspection of the CSS and JavaScript. And what about elements with white-space: pre?

fsouza commented 13 years ago

div's are not supposed to be inline, ever. If you need an inline element, don't use a div.

Anyway, this issue is a little tricky, I'm thinking about it...

bradbeattie commented 13 years ago

http://www.brunildo.org/test/inline-block.html

fsouza commented 13 years ago

Right, thanks for sharing :)

ghost commented 12 years ago

i'm still having the issue. Any news fsouza?

fsouza commented 12 years ago

not yet, sorry for the delay.

I'll take a look in the next weekend.

neoascetic commented 12 years ago

Any news on this?

atodorov commented 10 years ago

ping. any news on this ? It is messing up my pages.

hrbonz commented 10 years ago

Looking at the code and the tests, I can see why the whitespaces are dumped and why the test passes (it is made so it passes but does not cover the real problem).

In htmlmin/tests/resources/line_break.html the line starts with a whitespace which makes the test pass and gives the illusion that line break and whitespace is properly covered.

I think the htmlmin.minify.html_minify() logic is broken by design, as soon as the source is broken on line breaks it makes the proper handling of line breaks to spaces, multiple spaces, etc very difficult. I see several edge cases like \n where this won't be rendered properly. And of course the simple <p>my text\nbroken in two lines</p> is not rendered properly.

I think that a better way (probably slower though) would be to use beautifulsoup to dive in the dom tree and work on a element base. I'll try to implement a solution along those lines.

hrbonz commented 10 years ago

I'm coming back with some good news, I started by cleaning up a lot of the tests that were incorrect (trailing spaces just before </body> in htmlmin/tests/test_middleware.py for example). Then, I implemented what I think is a better and stronger way to parse the html source. A big problem I saw with the previous code was in the multiline handling and some edge cases. I think the code now is more robust. It's also faster! I couldn't believe it in the beginning but you can check by yourself, it's most of the time ~40% faster:

running 10000 rounds
with_html_content_in_pre.html
    old style: 38.43s
    recurse: 21.97s (57.16%)
with_textarea.html
    old style: 44.63s
    recurse: 25.22s (56.50%)
without_doctype.html
    old style: 34.30s
    recurse: 18.61s (54.27%)
with_entities_in_textarea.html
    old style: 49.32s
    recurse: 29.76s (60.35%)
html5.html
    old style: 58.72s
    recurse: 34.28s (58.39%)
blogpost.html
    old style: 231.60s
    recurse: 143.30s (61.87%)
with_pre.html
    old style: 97.18s
    recurse: 73.11s (75.23%)
with_html_content_in_javascript.html
    old style: 55.58s
    recurse: 32.67s (58.79%)
with_javascript.html
    old style: 54.14s
    recurse: 31.36s (57.92%)
with_html_content_in_textarea.html
    old style: 46.59s
    recurse: 27.07s (58.11%)
with_textarea_with_blank_lines.html
    old style: 44.96s
    recurse: 25.55s (56.82%)
non_ascii_in_excluded_element.html
    old style: 40.09s
    recurse: 22.13s (55.20%)
with_multiple_comments.html
    old style: 41.99s
    recurse: 23.58s (56.16%)
multiple_spaces.html
    old style: 32.43s
    recurse: 16.68s (51.43%)
with_comments_to_exclude.html
    old style: 40.55s
    recurse: 22.58s (55.68%)
with_comments.html
    old style: 40.51s
    recurse: 22.56s (55.68%)
with_menu.html
    old style: 45.32s
    recurse: 26.19s (57.78%)
non_ascii.html
    old style: 36.60s
    recurse: 20.31s (55.49%)
line_break.html
    old style: 28.06s
    recurse: 14.07s (50.15%)
with_blank_lines.html
    old style: 39.96s
    recurse: 22.55s (56.43%)
with_conditional_comments.html
    old style: 42.32s
    recurse: 23.58s (55.73%)
with_multiple_line_comments.html
    old style: 40.68s
    recurse: 22.56s (55.47%)
with_old_doctype.html
    old style: 39.15s
    recurse: 21.03s (53.70%)

Let me know what you guys think about the code in #69

o3bvv commented 9 years ago

By the way, non-breakable whitespaces (&nbsp;) are removed from output too.