Open-CAS / open-cas-linux

Open CAS Linux
https://open-cas.com
BSD 3-Clause "New" or "Revised" License
216 stars 82 forks source link

OpenCAS writeback caching for a home NAS with large media files with multi-tier (NVME+SSD+HDD) #1487

Open TheLinuxGuy opened 1 month ago

TheLinuxGuy commented 1 month ago

Question

Looking to ensure that I am using the correct settings to achieve my goal of always write/read from NVME disks and promoting data from HDD as soon as files are accessed.

Motivation

I'm comparing bcache to OpenCAS to see if it can fit my needs. I have some notes in one of my repositories here.

Background:

My goal is as follows:

Your Environment

lsblk

# lsblk
NAME         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda            8:0    0  12.7T  0 disk
sdb            8:16   0  16.4T  0 disk
sdc            8:32   0   9.1T  0 disk
├─sdc1         8:33   0   7.3T  0 part
│ └─md127      9:127  0  21.8T  0 raid5
│   └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
└─sdc2         8:34   0   1.8T  0 part
  └─md126      9:126  0   1.8T  0 raid1
sdd            8:48   0   9.1T  0 disk
├─sdd1         8:49   0   7.3T  0 part
│ └─md127      9:127  0  21.8T  0 raid5
│   └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
└─sdd2         8:50   0   1.8T  0 part
  └─md126      9:126  0   1.8T  0 raid1
sde            8:64   0   7.3T  0 disk
└─sde1         8:65   0   7.3T  0 part
  └─md127      9:127  0  21.8T  0 raid5
    └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
sdf            8:80   0   7.3T  0 disk
└─sdf1         8:81   0   7.3T  0 part
  └─md127      9:127  0  21.8T  0 raid5
    └─cas1-1 250:0    0  21.8T  0 disk  /mnt/btrfs
zd0          230:0    0    32G  0 disk
├─zd0p1      230:1    0    31G  0 part
├─zd0p2      230:2    0     1K  0 part
└─zd0p5      230:5    0   975M  0 part
zd16         230:16   0    32G  0 disk
zd32         230:32   0    32G  0 disk
├─zd32p1     230:33   0    31G  0 part
├─zd32p2     230:34   0     1K  0 part
└─zd32p5     230:37   0   975M  0 part
nvme1n1      259:0    0 931.5G  0 disk
nvme0n1      259:1    0 931.5G  0 disk
nvme2n1      259:2    0 119.2G  0 disk
├─nvme2n1p1  259:3    0  1007K  0 part
├─nvme2n1p2  259:4    0     1G  0 part
└─nvme2n1p3  259:5    0   118G  0 part

casadm -P

# casadm -P -i 1
Cache Id                  1
Cache Size                241643190 [4KiB Blocks] / 921.80 [GiB]
Cache Device              /dev/nvme1n1
Exported Object           -
Core Devices              1
Inactive Core Devices     0
Write Policy              wb
Cleaning Policy           nop
Promotion Policy          always
Cache line size           4 [KiB]
Metadata Memory Footprint 10.6 [GiB]
Dirty for                 695 [s] / 11 [m] 35 [s]
Status                    Running

╔══════════════════╤═══════════╤═══════╤═════════════╗
║ Usage statistics │   Count   │   %   │   Units     ║
╠══════════════════╪═══════════╪═══════╪═════════════╣
║ Occupancy        │      5356 │   0.0 │ 4KiB Blocks ║
║ Free             │ 241637834 │ 100.0 │ 4KiB Blocks ║
║ Clean            │         2 │   0.0 │ 4KiB Blocks ║
║ Dirty            │      5354 │   0.0 │ 4KiB Blocks ║
╚══════════════════╧═══════════╧═══════╧═════════════╝

╔══════════════════════╤══════════╤═══════╤══════════╗
║ Request statistics   │  Count   │   %   │ Units    ║
╠══════════════════════╪══════════╪═══════╪══════════╣
║ Read hits            │ 29676031 │  67.7 │ Requests ║
║ Read partial misses  │        0 │   0.0 │ Requests ║
║ Read full misses     │       40 │   0.0 │ Requests ║
║ Read total           │ 29676071 │  67.7 │ Requests ║
╟──────────────────────┼──────────┼───────┼──────────╢
║ Write hits           │ 10558462 │  24.1 │ Requests ║
║ Write partial misses │        0 │   0.0 │ Requests ║
║ Write full misses    │  3628633 │   8.3 │ Requests ║
║ Write total          │ 14187095 │  32.3 │ Requests ║
╟──────────────────────┼──────────┼───────┼──────────╢
║ Pass-Through reads   │        0 │   0.0 │ Requests ║
║ Pass-Through writes  │        0 │   0.0 │ Requests ║
║ Serviced requests    │ 43863166 │ 100.0 │ Requests ║
╟──────────────────────┼──────────┼───────┼──────────╢
║ Total requests       │ 43863166 │ 100.0 │ Requests ║
╚══════════════════════╧══════════╧═══════╧══════════╝

╔══════════════════════════════════╤═══════════╤═══════╤═════════════╗
║ Block statistics                 │   Count   │   %   │   Units     ║
╠══════════════════════════════════╪═══════════╪═══════╪═════════════╣
║ Reads from core(s)               │       264 │ 100.0 │ 4KiB Blocks ║
║ Writes to core(s)                │         0 │   0.0 │ 4KiB Blocks ║
║ Total to/from core(s)            │       264 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼───────────┼───────┼─────────────╢
║ Reads from cache                 │  74297900 │  63.3 │ 4KiB Blocks ║
║ Writes to cache                  │  43010161 │  36.7 │ 4KiB Blocks ║
║ Total to/from cache              │ 117308061 │ 100.0 │ 4KiB Blocks ║
╟──────────────────────────────────┼───────────┼───────┼─────────────╢
║ Reads from exported object(s)    │  74298164 │  63.3 │ 4KiB Blocks ║
║ Writes to exported object(s)     │  43009897 │  36.7 │ 4KiB Blocks ║
║ Total to/from exported object(s) │ 117308061 │ 100.0 │ 4KiB Blocks ║
╚══════════════════════════════════╧═══════════╧═══════╧═════════════╝

╔════════════════════╤═══════╤═════╤══════════╗
║ Error statistics   │ Count │  %  │ Units    ║
╠════════════════════╪═══════╪═════╪══════════╣
║ Cache read errors  │     0 │ 0.0 │ Requests ║
║ Cache write errors │     0 │ 0.0 │ Requests ║
║ Cache total errors │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Core read errors   │     0 │ 0.0 │ Requests ║
║ Core write errors  │     0 │ 0.0 │ Requests ║
║ Core total errors  │     0 │ 0.0 │ Requests ║
╟────────────────────┼───────┼─────┼──────────╢
║ Total errors       │     0 │ 0.0 │ Requests ║
╚════════════════════╧═══════╧═════╧══════════╝

casadm -L

# casadm -L
type    id   disk           status    write policy   device
cache   1    /dev/nvme1n1   Running   wb             -
└core   1    /dev/md127     Active    -              /dev/cas1-1
TheLinuxGuy commented 1 month ago

Also an important question, it seems like btrfs filesystem is not in the supported filesystems list: https://open-cas.com/guide_system_requirements.html

Btrfs seems to be working okay in my testbench with OpenCAS... is OpenCAS team testing btrfs or planning to support btrfs or any other advanced file system like zfs? Ext4 did give me better benchmarks but I rather use btrfs at minimum... this is not an issue with bcache.

robertbaldyga commented 1 month ago

@TheLinuxGuy Technically Open CAS should be able to handle any filesystem, as it conforms to the standard Linux bdev interface. So btrfs and zfs almost certainly work just fine. What "supported" means in our case is that we actually do test it for the listed filesystems. Open CAS has a quite extensive set of functional tests which we execute for each release. I'm not sure how much bcache developers test it with various filesystems - I was not able to find this information - but extending our tests set is certainly possible. So far we did not consider adding tests for the other filesystems because no one asked for them.

We'll try to evaluate how much would that cost to include btrfs and zfs in our testing scope. For the context, currently the full execution of Open CAS functional tests takes about a week (day and night), so the cost is not negligible. We value stability of the project, and as much as we'd like to support every single configuration and scenario, we first need to make sure that whatever we decide to support, we are able to do it in excellent quality over long period of time.

TheLinuxGuy commented 4 weeks ago

currently the full execution of Open CAS functional tests takes about a week (day and night), so the cost is not negligible. We value stability of the project, and as much as we'd like to support every single configuration and scenario, we first need to make sure that whatever we decide to support, we are able to do it in excellent quality over long period of time.

Understood, thank you for the detailed explanation and for the consideration.

XFS and EXT4 are reliable and great - but snapshotting, checksumming, btrfs-send/zfs-send are modern filesystem features that I feel are in demand. IIRC Facebook uses btrfs on all their production fleet. https://facebookmicrosites.github.io/btrfs/docs/btrfs-facebook.html