Closed Kirin-kun closed 3 years ago
@Kirin-kun , thanks for writing all this down.
I think this would be more widely seen in a wiki page. OK with you to create one with this info?
@archiecobbs That's fine by me.
Great, thanks.
I put it here: https://github.com/archiecobbs/s3backer/wiki/Case-Study:-Setting-up-NAS-using-ZFS-over-s3backer
Thanks to both @Kirin-kun and @archiecobbs for this!
At @ahmgithubahm request, I publicize here my setup for a NAS on S3. It's meant to be used for some backups and for file sharing between developers
By no means do I think it's the best setup, or even the less costly, but it should reduce the price compared to a full EBS solution (even, in some cases, compared to sc1 EBS, which have weak performances). I didn't run the numbers yet.
We had a SoftNAS, but I discovered that it actually uses s3backer+ZFS under the hood for its "Cloud disks". I got interested and dug further. The version of s3backer they used was not recent, didn't support server side encryption with a custom key and the zfs in there didn't support trimming.
It may be great for people that just want to click and be done with it, but I'm a command line guy, so I thought "We don't need to pay for that!". So I tossed aside the idea of renewing our subscription (even more so that they changed their licensing model when they got bought out by some other company).
So, finally, my requirement were:
So, zfs on s3backer.
s3fs, s3ql and others can't hold a candle to it, because it gives you the choice to use the filesystem you want.
A zfs vdev on s3backer is easily extensible. You just have to start s3backer with a bigger size and --force and the pool will automatically expand (or you can expand it manually). There is a dire warning from s3backer, but actually nothing special happens if it's just growing the volume.
No redundancy in the pool: redundancy is for the case a hardware device fails. s3 is not supposed to fail (99,999999999% SLA), basically, if your data has been successfully written in S3, you'd have to be really unlucky for it to be lost.
And to have redundancy would defeat the cost saving purpose anyway.
The point of failure, where you could lose data, becomes OS crashes, where data in memory hasn't been written to stable storage yet.
Part of that is handled by ZFS with the ZIL for synchronous writes: it stores data in there, before actually spreading it in the other vdevs, to acknowledge the sync write as soon as possible. The ZIL is not a write cache. It's almost never read, except after a crash, to flush data that hadn't been committed to the vdevs.
By default, the ZIL is actually embedded in the pool. So, with s3 as a backend, data would be written twice in the pool, at the speed of http. This would probably give poor performance for synchronous writes (NFS does synchronous writes by default).
The answer is a SLOG, a vdev dedicated to the ZIL. When synchronous writes are required, ZFS writes them in the fast SSD and acknowledges them, while queuing them for actual write, from memory, on the backend. This would improve the performance of ZFS on S3 system.
If you can't bear to lose even the in-memory asynchronous writes that would be lost in case of crash, you can instruct the pool to sync=always, meaning that async writes are treated like sync ones. But it would hinder performances (by what amount, I don't know).
Next are reads. You can also add a L2ARC, an adaptive cache that ZFS uses to smartly cache frequently used data that are not important enough to be kept in ARC (RAM).
So my setup is :
If there was really a need for a gui, I think webmin would do the job.
I'm thinking about putting a lifecycle policy on the bucket, so blocks not often (re)written are migrated toward cheaper s3 classes, but I'm not sure it's worth the performance hit, or if it would even work.
The 50Gb gp2 is sliced into 3 partitions of equal size (around 17Gb each):
Format /dev/xvdb1 in xfs and mount it as /mnt/cache
Start s3backer with configfile:
Can add "-O sync=always" if you want to secure all writes to the SLOG, even async ones (cost is a performance hit for async I/O loads)
Add L2ARC:
# zpool add POOLNAME cache /dev/xvdb2
Add SLOG for the ZIL:
# zpool add POOLNAME log /dev/xvdb3
Verify:
With regular trimming, you can reclaim the blocks from deleted files that didn't get deleted with autotrim, so you should only use the space you need to.
With the recent corrections, even with a trim running, the performances are good.
The only "grip" I have is that the --listBlocks is taking ages once you have a lot of data in your pool, but it's actually necessary if you want the first trim to not kill your system. So, best case, for now, is to restart your s3backer as seldom as possible.
If people have more insights about this, they are welcome to give them.
It also serves as a testimony, to thank @archiecobbs for this software and his reactivity in correcting bugs and even adding new features in a few hours! If I had a way, I'd buy you a beer (or a juice if you don't drink alcohol).
And I wish a Merry XMAS to all and their family!