cloudhead / node-static

rfc 2616 compliant HTTP static-file server module, with built-in caching.
MIT License
2.17k stars 245 forks source link

Explain why a builtin cache is a good idea #18

Closed otterley closed 10 years ago

otterley commented 13 years ago

Please explain why buffering static file data in memory is a good idea, and provide comparative benchmarks proving it. Most modern kernels already do this for you (see the 'cache' value in free(1)); caching in user space in addition is almost always a waste of memory.

indexzero commented 13 years ago

No offense, but you made the conjecture. I believe the onus is on you to prove why it's a bad idea before claiming it is.

otterley commented 13 years ago

I'm not offended. :)

I believe my report was adequately descriptive. As I said, the kernel already provides a very good block I/O caching layer. Have you actually benchmarked it with the cache disabled vs. enabled?

indexzero commented 13 years ago

I've been having a hard time getting good benchmarks b/c I think my own network IO is the upper limit throughput of the benchmark. I'm using spark / spark2 + node-static.

Maybe we could coordinate a siege -c200 -t5M http://somenodejsbenchmarksite.com sometime

Marak commented 13 years ago

I can't comment on how node deals with caching, but I can tell you from experience that using a basic in memory cache for reading and serving small files in node will yield a vast improvement in performance.

otterley commented 13 years ago

Please show us the data, Marak.

Marak commented 13 years ago

Hrmm? If you disagree, start a server up with cache and without cache and compare your average response times. I'm not going to take my time to make a benchmark for you.

I'd be interested to see comprehensive data to make a more informed decision, but from my personal node experience the in-memory cache is faster so we'll continue to use it for our production facing node-static sites.

Furthermore, I've spoken with our resident kernel hacker and he's telling me your understanding of caching is somewhat incorrect. I wish I could comment more, but I'm no systems expert. I would suggest posting your issue to the node.js mailing list as a more general node.js development question since this issue isn't just specific to node-static. You'll get a much better answer I'm sure.

otterley commented 13 years ago

Your "resident kernel hacker" is correct to the extent that application-based caching can save some open() and read() syscalls. This can improve performance for very small working sets, at the very high cost of precious memory that other applications might benefit from. In any event, these syscalls are generally fast enough that performance without the additional cache will almost always be more than adequate.

For, say, edge caching and other enterprise workloads (i.e. where the working set is not unrealistically small), the incremental improvement in performance gained by avoiding these syscalls is small compared to the inefficiencies incurred in caching in user-space. Using this module incurs at least one additional copy of the data in RAM: one that's mapped to the block I/O cache (since the NodeJS fs module doesn't use direct I/O) and another copy for each NodeJS process. Consider that very large machines often have 8 cores; unless you're running only one NodeJS process on that host, you're going to have N copies around. This means you'll be able to effectively cache, at most, RAM / (n+1), where n is the number of NodeJS processes.

I encourage you to run a benchmark, ensuring that your working set is not superficially small. Most benchmarks that appear to demonstrate the effectiveness of a user-space RAM cache exercise retrieving a single file. Such a benchmark is perhaps true, but it is also misleading with respect to real-world performance. To be practically relevant, the working set in such benchmarks must be larger than (RAM / (n+1)), where n is the number of CPUs concurrently running NodeJS); they almost never are.

And beware: when the cache working set grows larger than RAM (which is quite possible because there's no ceiling on the module's cache size and no LRU or LFU eviction policy), the kernel will swap Node.js pages to disk. This is perhaps the worst case imaginable.

Please have your team contact me directly if they disagree with this analysis.

Marak commented 13 years ago

I don't think anyone really cares that much. You should post this to the node mailing list to get a better response.

cimnine commented 13 years ago

Hey otterley..

This is an open project. Why don't you just fork 'node-static' and implement a cache on/off switch and pull it back? Or bether, implement the missing ceiling of the cache size or an eviction policy. You seem to have the knowledge, and that way we would all benefit.

Because IMHO this is what open source, especially on GitHub, is all about.

Cheers, Chris

dobesv commented 13 years ago

It seems like you can disable the in-memory cache by specifying cache:false in the options.

However, that only disables serving files from the cache - at least looking at my current version of node-static it will still keep a copy of every file in memory forever as it doesn't check if the cache is enabled when it stores into the cache.

Given the choice I'd rather have no in-memory caching system than a broken / incomplete one.

I would suggest eliminating the in-memory cache until you can prove it makes a difference - in theory it should not and in practice it is broken. Seems like a classic case of premature optimization - that is, adding an optimization without first seeing if there is a problem, and without measuring whether the optimization made any improvement over the unoptimized version.

jvonmitchell commented 10 years ago

Can confirm. In memory cache improves performance. The file system cache improves performance as well.

Some of you are saying you want numbers. I can only go off of my head

For image files. Server side 200ms first serve, 20-40ms if it is in file system cache, 0-1ms if in cache.

Of course it is received slower than that but your app can call res.end() and back out of the function stack, the rest is async. Either way bandwidth will be your limit but the point is to free up the application's action time so it can handle more requests with less thread.

Addendum: The reason why I came into issues was to see if there was discussion on how the cache is implemented. It may be nice if the strategy used get maybe a cursory mention in the README. I like how the issues for this repo has been spammed by morons. If you don't have a useful request or a real bug please don't troll the nice people who made this.

phstc commented 10 years ago

@jvonmitchell Unfortunately the builtin cache was temporarily removed due many opened issues related to the high memory usage for this builtin cache. For example for people serving video files etc, node-static was storing all these files in memory no matter the size. The README is outdated, we need to update it (PR?).

I like the idea of the in memory cache, it is really fast, but maybe we have to create controls based on user's configurations to do not store files larger than SIZE (can be ZERO in order to do not store any file).

jvonmitchell commented 10 years ago

Yeah, I foresaw that issue and so I rolled my own. I might as well share what I did to avoid that problem. Basically there is a global threshold and when it goes over it takes out the longest stored items in a loop until we are under threshold. A simple Object.keys(cache) will give you a list of keys in order of their initial insertion.