Consider using a flat-file cache for the page cache

quicksketch commented 9 years ago

We've already optimized Backdrop to the point that a database connection is not necessary if you're using an alternative cache-backend like Memcache. Even though the database is unnecessary except for the single query to get the cached page, we still need that database connection just for that single query.

I was experimenting with using a flat-file cache as an alternative, a basic port of https://www.drupal.org/project/filecache. The results were fairly promising running ab -c 100 -n 100 http://backdrop.local/:

Database cache:

Requests per second:    569.45 [#/sec] (mean)
Time per request:       175.608 [ms] (mean)
Time per request:       1.756 [ms] (mean, across all concurrent requests)
Transfer rate:          12078.02 [Kbytes/sec] received

File cache:

Requests per second:    741.53 [#/sec] (mean)
Time per request:       134.857 [ms] (mean)
Time per request:       1.349 [ms] (mean, across all concurrent requests)
Transfer rate:          15727.74 [Kbytes/sec] received

So roughly a 30% performance increase in cached page delivery. The downside is that the disk becomes littered with non-sense cache files. But as far as disk-space goes, this is no different from the page cache, which takes up just as much room, but it's hidden away in the database.

If we can test this approach on various servers and consistently get a similar increase, I think we might consider bundling a flat-file cache and using it by default for the page cache.

biolithic commented 9 years ago

Yea performance! On Feb 13, 2015 9:42 PM, "Nate Haug" notifications@github.com wrote:

We've already optimized Backdrop to the point that a database connection is not necessary if you're using an alternative cache-backend like Memcache. Even though the database is unnecessary except for the single query to get the cached page, we still need that database connection just for that single query.

I was experimenting with using a flat-file cache as an alternative, a basic port of https://www.drupal.org/project/filecache. The results were fairly promising running ab -c 100 -n 100 http://backdrop.local/:

Database cache:

Requests per second: 569.45 #/sec Time per request: 175.608 ms Time per request: 1.756 [ms](mean, across all concurrent requests) Transfer rate: 12078.02 [Kbytes/sec] received

File cache:

Requests per second: 741.53 #/sec Time per request: 134.857 ms Time per request: 1.349 [ms](mean, across all concurrent requests) Transfer rate: 15727.74 [Kbytes/sec] received

So roughly a 30% performance increase in cached page delivery. The downside is that the disk becomes littered with non-sense cache files. But as far as disk-space goes, this is no different from the page cache, which takes up just as much room, but it's hidden away in the database.

If we can test this approach on various servers and consistently get a similar increase, I think we might consider bundling a flat-file cache and using it by default for the page cache.

— Reply to this email directly or view it on GitHub https://github.com/backdrop/backdrop-issues/issues/716.

serundeputy commented 9 years ago

this sounds very cool; I'd be glad to test locally; our on my linode;or both

quicksketch commented 9 years ago

I just deleted my entire sandbox for the second time (file_unmanaged_delete_recursive() is a dangerous tool). I'm not up for porting this a 3rd time tonight. So I'm afraid further benchmarks are going to have to wait.

serundeputy commented 9 years ago

oh no! wait it is;

klonos commented 9 years ago

file_unmanaged_delete_recursive() is a dangerous tool

Oh the memories this brought back!! ...back in the 1990s when we were learning DOS loaded from floppies, our teacher told us "NEVER-EVER use format c:!!!" :smile:

quicksketch commented 9 years ago

Thanks klonos, that makes me :smile:

It's been a little bit of a rough night in programming land. :stuck_out_tongue_winking_eye:

and1truong commented 9 years ago

If we use two servers, we have to cache two version? On cache flushing, how we make sure cache are flushed on all servers? Thanks

quicksketch commented 9 years ago

@andytruong in the event that you had two servers, you probably already need to share the files directory between the two of them (usually over an NFS mount), so the cache would still be shared between the servers just like any other file in the public files directory. However, in such a situation, you could switch the cache back to the database, or use another central storage like memcache. Lastly, if you had two servers, you'd probably also have Varnish or nginx in front of them distributing traffic between them, in which case the Backdrop page cache becomes inconsequential.

Overall, I think this would result in some situations (single-server) being faster than before but impacting other situations (multi-server). It would make simple sites faster, while those sites with more complicated architectures may need to make adjustments to optimize. Mutiserver environments already require a lot of additional planning, so this may be worth the tradeoff.

klonos commented 9 years ago

...so this may be worth the tradeoff.

Until we get the metrics gathering implemented we can not be sure, but Backdrop is aiming for low-end market. These use cases usually have a single server. So if the speed gain is considerable, I say we go ahead and implement this as the default. There can be a toggle in the advanced settings during setup to allow switching to database from day 1 and/or a toggle in /admin/config/development/performance to accommodate for use cases when people implement multiple servers later on in the site life cycle.

klonos commented 9 years ago

...at the very least and if setting as default is dangerous, we could simply add the toggle in /admin/config/development/performance and have a warning text about this feature not being suitable for multi-server setups.

Gormartsen commented 8 years ago

@quicksketch can you explain what do you mean by flat-file?

I probably can jump on this task, if you explain it.

quicksketch commented 8 years ago

@Gormartsen so my intention here had been to implement the full extent of a cache backend that wrote to files in the public files directory.

cache_set($key, $data, $bin) would hash the $key (probably with a salted sha1 hash) and then write to a file in files/cache/$bin/$salted_sha1. The contents of this file would have to include not only the cached content, but also the created and expiration time of the cache entry.
cache_get($key) would sha1 again and then if the file existed would read out the file from disk. The returned content of cache_get() would be identical to other cache backends, including the created and expiration time (which were written to the file on disk).

Things get a little more complicated when dealing with cache clears however. Due to the requirements of clearing data by cache key prefix and possibly in the future by cache tags, we need to have a record of cache entries by their non-hashed file names. So in addition:

cache_set($key, $data, $bin) would also write to the database just to track the expiration time of the cache files that were created.
cache_clear_all() would read from the database to clear by cache key prefix. Cron runs would also use the database to delete the associated files as needed on cron runs.

So this would result in slower writes and cache clears, but really fast reads. For the page cache, it would be particularly well suited, because cache hits would no longer require a database connection at all. It would not be well-suited to situations where cache writes are frequent, as it would generally be slower than a database cache because you're writing in two locations.

So it's pretty close to what you were proposing in https://github.com/backdrop/backdrop-issues/issues/1413, but it would be a standard cache-backend that could be used for any caching.

Gormartsen commented 8 years ago

@quicksketch Please correct me, if I am missing anything here.

We need FileCache implementation based on BackdropCacheInterface

I can use BackdropDatabaseCache as example of implementation.

We need next features:

[ ] clean cached data by time expire per KEY or keep it as PERMANENT if required.
[ ] clean cached data by cache_clean_all() (it removes all cached data, include PERMANENT)
[ ] remove cached data by key prefix. If we have multiple keys fd81** fd82** fd83**\ - we need to be able to remove all of them by fd8 prefix. BackdropCacheInterface::deletePrefix($prefix) Please also let me know a reason to do so, I never had a need to clean cache by cache key prefix. If I understand a reason, I can find best implementation to do so.

Note: Keeping cache in files has unique features compare to database:

file modification time is constant. It can be used to "clean all cached files created before or after xxxxxx"
file has access time, it's last cache HIT time (unless sysadmin put noatime on partition, but it is not common practice). This way we can find cached data that was not used for a while to clean it. Or we can get a list of keys that is most used.
we can group files by BIN - using different folders.
we can group files by TAG - using symlink from TAG/$key to BIN/$key. Same cache could be TAGGED multiple times, while storing only once.

Performance. Is there particular reason to use SHA1 instead of MD5 ?

MD5 a little bit faster that SHA1. See next code and results:

<?php

echo 'Building random data ...' . PHP_EOL;

$data = '';
for ($i = 0; $i < 64000; $i++) {
    $data .= hash('md5', rand(), true);
}

$time = microtime(true);
sha1($data);
$time = microtime(true) - $time;
$results[$time * 1000000000][] = "sha1";

$time = microtime(true);
md5($data);
$time = microtime(true) - $time;
$results[$time * 1000000000][] = "md5";

ksort($results);

echo PHP_EOL . PHP_EOL . 'Results: ' . PHP_EOL;

$i = 1;
foreach ($results as $k => $v)
    foreach ($v as $k1 => $v1)
        echo ' ' . str_pad($i++ . '.', 4, ' ', STR_PAD_LEFT) . '  ' . str_pad($v1, 30, ' ') . ($k / 1000) . ' microseconds' . PHP_EOL;

Results:
   1.  md5                           2204.895 microseconds
   2.  sha1                          2443.075 microseconds

So it's pretty close to what you were proposing in #1413, but it would be a standard cache-backend that could be used for any caching.

Yes, I decided to write general cache backend interface and make it possible to select. See #1434

quicksketch commented 8 years ago

We need FileCache implementation based on BackdropCacheInterface I can use BackdropDatabaseCache as example of implementation.

Yep! That's the basic idea.

clean cached data by time expire per KEY or keep it as PERMANENT if required.

This in particular is a required feature and the primary reason why using the database may still be necessary.

I never had a need to clean cache by cache key prefix. If I understand a reason, I can find best implementation to do so.

The reason for this is to be able to clear all caches related to a particular module without using a separate cache bin. For example if a module did several cache sets like this:

cache_set('my_module:foo:1', $data);
cache_set('my_module:foo:2', $data);
cache_set('my_module:foo:3', $data);
cache_set('my_module:bar:1', $data);

Then the cache entries within the "foo" group could be cleared with:

cache_clear_all('my_module:foo:', 'cache', TRUE);
// Or:
cache()->deletePrefix('my_module:foo:');

Deleting by cache prefix is not very common. Usually we set up dedicated cache bins these days. But the ability to have queryable cache entries is still required for setting and finding expiration times. In the future if we implement something similar to D8's cache tags (which I think is a good idea), then that functionality would replace the deletePrefix() approach.

Is there particular reason to use SHA1 instead of MD5 ?

Mostly because SHA1 has a larger pool and lowers any chance of a key conflict. We'll only be calling SHA1 on cache set and get, which will likely only be a few times (if using for the page cache). Performance wise, in your test the difference between a single SHA1 and an MD5 is 0.00375ms. That shouldn't be an issue. I also have a small concern that cache entries may contain sensitive data, so that's the reason for salting with something like the site private key so that if left unprotected, these file names would not be easily guessable.

Note: Keeping cache in files has unique features compare to database:

That may be true, but as with storing in APC, Redis, Memcache, or any other cache backend, they each have unique features. In the case of a generic cache backend, we have to be able to implement the same features across all of them.

Gormartsen commented 8 years ago

That may be true, but as with storing in APC, Redis, Memcache, or any other cache backend, they each have unique features. In the case of a generic cache backend, we have to be able to implement the same features across all of them.

I agree. What I am trying to say is that TAG ing could be done with out Database use at all, simply by creating public://filecache/TAGNAME folder and symlink keys from public://filecache/BIN to TAG folder.

Then if we need to clean all keys by TAG, we just simply read TAGNAME folder and remove files from BIN folder and all symlinks from TAG folder.

Same Idea could be done for time expiration. We can create a dir public://filecache/EXPIRE/timestamp , symlink there keys from BIN and in garbageCollection() get all directories with timestamp less than REQUEST_TIME and clean files by symlinks.

quicksketch commented 8 years ago

I'm not keen on using symlinks personally. They can be tricky to manage for both developers and site administrators. In D8-land, cache entries for pages have around 50 tags per page (a unique combination of language, each node, comment, user, term, etc. shown on the entire page). If we took the same approach as D8, we'd end up with a symlink (cache tag) for every entity on the entire site. That'd make for a lot of symlinks.

They're also not universally supported across difference file systems (e.g. fat32). In any case, for the time being we don't have tags, we only have cache prefixes.

Gormartsen commented 8 years ago

I see you point. I will think about it.

backdrop / backdrop-issues

Consider using a flat-file cache for the page cache #716