bloudraak / htmlcompressor

Automatically exported from code.google.com/p/htmlcompressor
Apache License 2.0
0 stars 0 forks source link

Whitespace between inline tags is not preserved #18

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
According to the HTML 4.01 specification whitespace at the beginning and the 
end of a tag can be removed and multiple whitespace characters can be 
compressed into a single character.

However, htmlcompressor doesn't handle whitespace correctly:

$ cat t.html
<html>
  <head>
    <title>foobar</title>
  </head>
  <body>
    <p>
      <span>foo</span>
      <span>bar</span>
    </p>
  </body>
</html>
$ java -jar htmlcompressor-0.9.3.jar t.html
<html> <head> <title>foobar</title> </head> <body> <p> <span>foo</span> 
<span>bar</span> </p> </body>
$ java -jar htmlcompressor-0.9.3.jar --remove-intertag-spaces
<html><head><title>foobar</title></head><body><p><span>foo</span><span>bar</span
></p></body></html>

Both outputs are incorrect. The correct version would be:

<html><head><title>foobar</title></head><body><p><span>foo</span> 
<span>bar</span></p></body>

Original issue reported on code.google.com by o...@mirix.org on 24 Sep 2010 at 11:50

GoogleCodeExporter commented 8 years ago
So can you come up with the rule when spaces should be removed and when they 
shouldn't? I can't. 

Should spaces be removed here?
<p>
    <div style="display:inline">foo</div> 
    <div style="display:inline">bar</div>
</p>

What about here:
<p>
    <span style="display:block">foo</span>
    <span style="display:block">bar</span> 
</p>

It is impossible to guess your intentions. You can take pretty much any html 
element and turn it into something completely different.

Original comment by serg472@gmail.com on 25 Sep 2010 at 3:25

GoogleCodeExporter commented 8 years ago
See http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1:

  For all HTML elements except PRE, sequences of white space separate "words" (we use the term "word" here to mean "sequences of non-white space characters"). When formatting text, user agents should identify these words and lay them out according to the conventions of the particular written language (script) and target medium.

  This layout may involve putting space between words (called inter-word space), but conventions for inter-word space vary from script to script. For example, in Latin scripts, inter-word space is typically rendered as an ASCII space ( ), while in Thai it is a zero-width word separator (​). In Japanese and Chinese, inter-word space is not typically rendered at all.

  Note that a sequence of white spaces between words in the source document may result in an entirely different rendered inter-word spacing (except in the case of the PRE element). In particular, user agents should collapse input white space sequences when producing output inter-word space. This can and should be done even in the absence of language information (from the lang attribute, the HTTP "Content-Language" header field (see [RFC2616], section 14.12), user agent settings, etc.).

  The PRE element is used for preformatted text, where white space is significant.

  In order to avoid problems with SGML line break rules and inconsistencies among extant implementations, authors should not rely on user agents to render white space immediately after a start tag or immediately before an end tag. Thus, authors, and in particular authoring tools, should write:

    <P>We offer free <A>technical support</A> for subscribers.</P>

  and not:

    <P>We offer free<A> technical support </A>for subscribers.</P>

As said, whitespace at the beginning and the end of a tag can be removed and 
multiple whitespace characters can be compressed into a single character.

Original comment by o...@mirix.org on 25 Sep 2010 at 10:22

GoogleCodeExporter commented 8 years ago
Your examples should be compressed as follows:

<p><div style="display:inline">foo</div>  <div 
style="display:inline">bar</div></p>

<p><span style="display:block">foo</span> <span 
style="display:block">bar</span></p>

Though I think there could be problems with the CSS 3 white-space-collapse 
property, but on could argue that it is not the task of a HTML compressor to 
interpret CSS. A workaround could be to specify the tags, ids or classes that 
html compressor should not collapse.

Original comment by o...@mirix.org on 25 Sep 2010 at 10:27

GoogleCodeExporter commented 8 years ago
What about this:

<div>
    <div>1</div>
    <div>2</div>
</div>    
<div>
    <div><img/></div>
    <div><img/></div>
</div>    

Sorry I still don't see a pattern.

You said prev example should be compressed like this:
<p><span style="display:block">foo</span> <span 
style="display:block">bar</span></p>

But who said I don't want a space between <p> and <span>? Maybe I want maybe I 
don't. What about space after </p>? Maybe I need it there as well. 

Original comment by serg472@gmail.com on 25 Sep 2010 at 4:33

GoogleCodeExporter commented 8 years ago
The HTML 4.01 specification says it all. Your example would be:

<div><div>1</div> <div>2</div></div><div><div><img></div> <img></div></div>

HTML5 is more specific about this, since it contains default CSS rules and 
falls back to CSS defaults: 
http://lists.whatwg.org/pipermail/help-whatwg.org/2010-September/000665.html

Original comment by o...@mirix.org on 25 Sep 2010 at 11:13

GoogleCodeExporter commented 8 years ago
I made typo, the example should read as:

<div><div>1</div> <div>2</div></div><div><div><img></div> <div><img></div></div>

Original comment by o...@mirix.org on 25 Sep 2010 at 11:14

GoogleCodeExporter commented 8 years ago
I don't see anything in specs that says which spaces should be removed. Can you 
please show me where it says that?

Why there is no space here:

    <div>2</div>
</div>    
<<<<<<<<<<<<<< here
<div>
    <div><img/></div>

in prev example?

So:
<div>1</div>   
<div>2</div>

becomes:
<div>1</div> <div>2</div>

But:
<div> <div>1</div> </div> 
<div> <div>2</div> </div>

becomes:
<div><div>1</div></div><div><div>2</div></div>

?

If I have two divs:
<div></div> <div></div>

Should there be space between or not?

Original comment by serg472@gmail.com on 26 Sep 2010 at 6:02

GoogleCodeExporter commented 8 years ago
http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 describes the collapsing 
and removal of whitespace. I did quote the relevant paragraphs in Comment #2.

Original comment by o...@mirix.org on 26 Sep 2010 at 11:55

GoogleCodeExporter commented 8 years ago
In addition you could also remove instead of collapse the whitespace between 
block tags.

Original comment by o...@mirix.org on 26 Sep 2010 at 12:49

GoogleCodeExporter commented 8 years ago
Single space always matters. That spec is talking about removing spaces after 
rendering a page, it doesn't say anything about removing spaces _before_ 
rendering. Leaving one space everywhere instead of multiple spaces before 
rendering is the only safe way of doing it.

You can't remove single space at the beginning or end of any tag without 
potentially rendering a page differently.

All these:
<span> <span>1</span> </span><span> <span>2</span> </span>
<span> 1 </span><span> 2 </span>
<span>1</span> <span>2</span>

will be rendered as "1 2". Removing any spaces would break it.

Original comment by serg472@gmail.com on 26 Sep 2010 at 4:09

GoogleCodeExporter commented 8 years ago
Issue 41 has been merged into this issue.

Original comment by serg472@gmail.com on 3 May 2011 at 3:27