A setup and settings for a NAS on S3

Kirin-kun commented 3 years ago

At @ahmgithubahm request, I publicize here my setup for a NAS on S3. It's meant to be used for some backups and for file sharing between developers

By no means do I think it's the best setup, or even the less costly, but it should reduce the price compared to a full EBS solution (even, in some cases, compared to sc1 EBS, which have weak performances). I didn't run the numbers yet.

We had a SoftNAS, but I discovered that it actually uses s3backer+ZFS under the hood for its "Cloud disks". I got interested and dug further. The version of s3backer they used was not recent, didn't support server side encryption with a custom key and the zfs in there didn't support trimming.

It may be great for people that just want to click and be done with it, but I'm a command line guy, so I thought "We don't need to pay for that!". So I tossed aside the idea of renewing our subscription (even more so that they changed their licensing model when they got bought out by some other company).

So, finally, my requirement were:

Store data transparently on s3 and only the data that needs to be stored. In the same region to avoid costs of transfer.
Usable by applications directly, like a file system or share.
Data encryption on both ends (in the instance and in the bucket)
Extensible storage => s3 can grow indefinitely.
Reasonably good performance.
Some reliability, but nothing critical should be stored there.
Mixed NFS and CIFS shares, with permissions management (POSIX and NTACL).
Possibility to have "previous versions" (vfs_shadow) activated in Samba for Windows clients.

So, zfs on s3backer.

s3fs, s3ql and others can't hold a candle to it, because it gives you the choice to use the filesystem you want.

A zfs vdev on s3backer is easily extensible. You just have to start s3backer with a bigger size and --force and the pool will automatically expand (or you can expand it manually). There is a dire warning from s3backer, but actually nothing special happens if it's just growing the volume.

No redundancy in the pool: redundancy is for the case a hardware device fails. s3 is not supposed to fail (99,999999999% SLA), basically, if your data has been successfully written in S3, you'd have to be really unlucky for it to be lost.

And to have redundancy would defeat the cost saving purpose anyway.

The point of failure, where you could lose data, becomes OS crashes, where data in memory hasn't been written to stable storage yet.

Part of that is handled by ZFS with the ZIL for synchronous writes: it stores data in there, before actually spreading it in the other vdevs, to acknowledge the sync write as soon as possible. The ZIL is not a write cache. It's almost never read, except after a crash, to flush data that hadn't been committed to the vdevs.

By default, the ZIL is actually embedded in the pool. So, with s3 as a backend, data would be written twice in the pool, at the speed of http. This would probably give poor performance for synchronous writes (NFS does synchronous writes by default).

The answer is a SLOG, a vdev dedicated to the ZIL. When synchronous writes are required, ZFS writes them in the fast SSD and acknowledges them, while queuing them for actual write, from memory, on the backend. This would improve the performance of ZFS on S3 system.

If you can't bear to lose even the in-memory asynchronous writes that would be lost in case of crash, you can instruct the pool to sync=always, meaning that async writes are treated like sync ones. But it would hinder performances (by what amount, I don't know).

Next are reads. You can also add a L2ARC, an adaptive cache that ZFS uses to smartly cache frequently used data that are not important enough to be kept in ARC (RAM).

So my setup is :

A S3 bucket encrypted with our key.
A sc2 instance M4.2X4 with RH7.6 on EBSs encrypted with our key.
An IAM role attached to that instance that gives it the rights on the bucket (+ policy to enforce the encryption and the used key).
A 20TB s3backer virtual device with blocks of 1024M. No need to add the overhead of compression to s3backer: it's handled by zfs.
Install a build of a recent version of zfs on Linux (for the trim feature).
A zpool created on the s3backer file.
An added dedicated 50Gb gp2 for caches purposes.
Snapshots: adding a cron to make rolling snapshots of the zfs datasets (zfs snapshots do not use space as long as there's no change in the data).
NFS Shares.
Samba shares linked to the Active Directory accounts.
Windows ACL management with setfacl and x_attrs

If there was really a need for a gui, I think webmin would do the job.

I'm thinking about putting a lifecycle policy on the bucket, so blocks not often (re)written are migrated toward cheaper s3 classes, but I'm not sure it's worth the performance hit, or if it would even work.

The 50Gb gp2 is sliced into 3 partitions of equal size (around 17Gb each):

Format /dev/xvdb1 in xfs and mount it as /mnt/cache

Start s3backer with configfile:

--accessEC2IAM=<iam role>
--listBlocks
--sse=aws:kms
--sse-key-id=arn:aws:kms:eu-west-1:<key>
--blockCacheThreads=20
--blockSize=1M
--size=20T
--prefix=poolname/
--region=eu-west-1
--baseURL=https://s3.eu-west-1.amazonaws.com/
--blockHashPrefix
--timeout=300
--minWriteDelay=60000
--blockCacheSize=17000
--blockCacheFile=/mnt/cache/s3cache
--blockCacheRecoverDirtyBlocks
--initialRetryPause=10000
--maxRetryPause=50000
--md5CacheTime=60000
--md5CacheSize=20971520
--force
bucketname
/mnt/s3

# ln -s /mnt/s3/file /dev/s3-0

# modprobe zfs

# zpool create -o ashift=12 -o autotrim=on -O atime=off -O compress=lz4 -O acltype=posixacl -O xattr=sa -O devices=on POOLNAME /dev/s3-0

Can add "-O sync=always" if you want to secure all writes to the SLOG, even async ones (cost is a performance hit for async I/O loads)

Add L2ARC:

# zpool add POOLNAME cache /dev/xvdb2

Add SLOG for the ZIL:

# zpool add POOLNAME log /dev/xvdb3

Verify:

# zpool status
  pool: POOLNAME
 state: ONLINE
config:

        NAME         STATE     READ WRITE CKSUM
        POOLNAME      ONLINE       0     0     0
          /dev/s3-0  ONLINE       0     0     0
        logs
          xvdb3      ONLINE       0     0     0
        cache
          xvdb2      ONLINE       0     0     0

errors: No known data errors

With regular trimming, you can reclaim the blocks from deleted files that didn't get deleted with autotrim, so you should only use the space you need to.

With the recent corrections, even with a trim running, the performances are good.

The only "grip" I have is that the --listBlocks is taking ages once you have a lot of data in your pool, but it's actually necessary if you want the first trim to not kill your system. So, best case, for now, is to restart your s3backer as seldom as possible.

If people have more insights about this, they are welcome to give them.

It also serves as a testimony, to thank @archiecobbs for this software and his reactivity in correcting bugs and even adding new features in a few hours! If I had a way, I'd buy you a beer (or a juice if you don't drink alcohol).

And I wish a Merry XMAS to all and their family!

archiecobbs commented 3 years ago

@Kirin-kun , thanks for writing all this down.

I think this would be more widely seen in a wiki page. OK with you to create one with this info?

Kirin-kun commented 3 years ago

@archiecobbs That's fine by me.

archiecobbs commented 3 years ago

Great, thanks.

I put it here: https://github.com/archiecobbs/s3backer/wiki/Case-Study:-Setting-up-NAS-using-ZFS-over-s3backer

ahmgithubahm commented 3 years ago

Thanks to both @Kirin-kun and @archiecobbs for this!

archiecobbs / s3backer

A setup and settings for a NAS on S3 #140