cherokee / webserver

Cherokee Web Server
GNU General Public License v2.0
568 stars 105 forks source link

Feature Request: HTML type code cleanup prior to Gzip/Deflate #450

Open danielniccoli opened 11 years ago

danielniccoli commented 11 years ago

Original author: pubcrawl...@gmail.com (April 15, 2009 17:32:41)

Requesting a new feature for Cherokee - configurable in Admin ideally that will enable Cherokee to do the following:

Remove HTML white space
Remove HTML comments
Remove CSS white space
Remove CSS comments

This would be prior to any Gzip/Deflate compression. Should trim down page sizes, make some things faster (especially at scale) and prevent unneeded disclosure of non-public info - like developer comments in code.

Original issue: http://code.google.com/p/cherokee/issues/detail?id=441

danielniccoli commented 11 years ago

From skar...@gmail.com on April 17, 2009 11:42:07 I think this a task for a script before you deploy your web documents to the production server.

It would make things even faster than a Cherokee module.

danielniccoli commented 11 years ago

From pubcrawl...@gmail.com on April 17, 2009 15:45:17 I agree that cleaning code would be best handled at development level.

However if you go poke around in everyone's final served HTML there is much to be learned from what is being left in there.

In hosted environments and where contractors and other third party code is running be it just HTML or dynamic applications - often much is being outputted that is unnecessary to the end user and potentially security issues.

danielniccoli commented 11 years ago

From ste...@konink.de on May 13, 2010 03:54:32 If someone can show me: 1) A program that does the above (either regexp or just plain C) 2) The impact on the gzip/deflate output and/or the browser rendering performance

I'll create the module. For now I'm skeptical, I understand that comments etc. could be stripped. But unless the input is validated HTML/XML/CSS, the only thing that might be done is best effort stream parsing. I kinda don't want to see that sites totally break because a developer forgot to close a comment.

Suggestions welcome...

danielniccoli commented 11 years ago

From alobbs on May 13, 2010 05:09:14 http://tidy.sourceforge.net/ ?

Although I don't think the web server is the right place for this. Actually, it'd repeat a million time a task that could be performed just once.

danielniccoli commented 11 years ago

From ste...@konink.de on May 13, 2010 05:16:58 It will be performed just once, if the iocache would take in account the encoder output as well.

danielniccoli commented 11 years ago

From alobbs on May 13, 2010 05:33:16 The I/O cache "only" caches static files.. it is not the right tool for that. Actually, the cache takes the less 'interesting' elements out of the cache poll whenever it's filled up, so you could not even be sure you'd do it just once unless you write the output into a (temporal?) file.

danielniccoli commented 11 years ago

From plundis@areaindex.com on May 13, 2010 14:34:21 When I submitted this we were migrating from Windows platform servers. There is a company called Port80 Software that has this functionality built into their HTTPZIP software, which we then used.

http://www.port80software.com/products/httpzip/faq.asp#CodeOptBreak

The functionality is optional when running HTTPZIP - so you can enable or disable as you see fit.

We front-end content with Cherokee and in some instances we have no control over what the developers do. So comments can trickle out in the source like:

I've seen other stuff that you won't benefit distributing to the world, that eat up bandwidth and are security issues.

This functionality to cleanup whitespace and remove comments is fine as an option and only when the IOCachce extends to reverse proxy items.

danielniccoli commented 11 years ago

From Kissa...@gmail.com on October 27, 2010 07:43:12 I see problems here, partly mentioned in comments before: You don’t want to parse files on each request, as that would unnecessarily decrease performance when one time minifying would’ve be enough. Thus, caching is needed. If you cache it, when should the cache be invalid and the file reparsed? With dynamic files / responses they could or do change every time. *You need valid HTTP, XML, CSS or even more performance intensive operations (see comment above)

In the end the application / files and the developer / webmaster themselves should know when they can and should minify. With dynamic content, in my opinion, it should be the application that caches and minifys, if you want to do it dynamically. Then you can add cache control, like many frameworks do with dynamic content and templates.

danielniccoli commented 11 years ago

From skar...@gmail.com on November 04, 2010 07:59:50 Maybe a handler like mod_modpagespeed[1] would be interesting.

Could we port it to Cherokee?

[1] http://code.google.com/p/modpagespeed/

danielniccoli commented 11 years ago

From ste...@konink.de on October 14, 2011 07:52:18 Now we do have the iocache, it might be interesting to actually implement this. Since for static content we would tidy only once.