enterJazz commented 1 year ago

In this issue, I catalog the different storage-IO fio benchmark variants. Additionally, I list the completion status of the benchmarks in question and where to find them.

Parameters

In this section, we list the different benchmark parameters, such as storage IO software stack (virtio, SPDK...) or guest type (VM, CVM). The Cartesian product of all these parameters produce the total of all benchmarks.

Environment Specific Params

Parameter 1 (P1): Guest Type

bare-metal
native VM (QEMU/KVM)
confidential VM (SEV-SNP)

SEV-ES required for huge pages - maybe needed for e.g. SPDK

P2: Guest Configuration / Resources

vCPU # (1|2|4)
Memory (1|2|4|8|16|32)GB

NOTE: should be allocated for same node for NUMA ( ryan only has one node - rose has two nodes ; more to watch out for ) for final version: NUMA (2- node machine)

Non-Env Specific Params

P3: Storage IO Software Stack

native Kernel IO (restriction (R): P1 bare-metal)

spdk, libaio, iouring also possible but probably not necessary

focus: spdk; iouring nice to have

following params only concern VMs:

virtio-(blk|nvme|scsi)
vhost-spdk w/ polling

P4: Encryption

dm-encrypt (on / off) (R: Kernel IO)
dm-verity in combination
spdk encryption (if available: otherwise, we need to write own encryption)

P4.5 Integrity

dm-verity (on / off) - also block level (inside kernel) )
spdk - own integrity ( maybe we also need to implement this )

P5: Storage Level

application specific - fio probably has two modes for this (direct / non-direct)

fio-direct may not work w/ file-level

file level
block level
possibly data-level (?)

P6: Measurement Type

IOPS (mix r/w, rand r/w)
IO avg latency (rand r/w, r/w)
IO bandwidth (r/w)

Non-Param Config

Further Optimization Areas (non-params)

Kernel version - newer kernel may perform better (e.g. w/ SPDK, NVMe...)

mmisono commented 1 year ago

For fio, we can try to use the same configuration as Spool (ATC'20)

Maybe we also want to measure it with 2MB to see if huge page effect (if any)

Notes from Robert

TODO:

[ ] measure varying page sizes (2MB ; currently only 4K supported)

mmisono commented 1 year ago

Here is a diagram of the Linux storage stack: https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

device mapper (dm-crypt, dm-verity, etc.) works on block layers
O_DIRECT is a file system option. Usually it means fs do not cache (https://man7.org/linux/man-pages/man2/open.2.html)
- we should use fio with --direct (meaning using O_DIRECT) so that we measure performance of the disk I/O, not memory cache
- also note that aio requires O_DIRECT as otherwise it is likely to become blocking io (cf. https://lse.sourceforge.net/io/aio.html)

Notes from Robert

may be useful to use O_SYNC as well to ensure synchronicity to avoid tail latency ( as in https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ ):

O_DIRECT (since Linux 2.4.10) Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user-space buffers. The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT.

mmisono commented 1 year ago

About dm-crypt: Probably we want to have no_read_workqueue/no_write_workqueue option to avoid asynchronous queueing operation to get better performance (https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/dm-crypt.html)

cf. https://blog.cloudflare.com/speeding-up-linux-disk-encryption/

enterJazz commented 1 year ago

Here is a diagram of the Linux storage stack: https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

* device mapper (dm-crypt, dm-verity, etc.) works on block layers

* O_DIRECT is a file system option. Usually it means fs do not cache (https://man7.org/linux/man-pages/man2/open.2.html)

  * we should use fio with `--direct` (meaning using O_DIRECT) so that we measure performance of the disk I/O, not memory cache
  * also note that aio requires O_DIRECT as otherwise it is likely to become blocking io (cf. https://lse.sourceforge.net/io/aio.html)

this is directly related to the comments in P5, right?

mmisono commented 1 year ago

Here is a diagram of the Linux storage stack: https://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram

* device mapper (dm-crypt, dm-verity, etc.) works on block layers

* O_DIRECT is a file system option. Usually it means fs do not cache (https://man7.org/linux/man-pages/man2/open.2.html)

  * we should use fio with `--direct` (meaning using O_DIRECT) so that we measure performance of the disk I/O, not memory cache
  * also note that aio requires O_DIRECT as otherwise it is likely to become blocking io (cf. https://lse.sourceforge.net/io/aio.html)

this is directly related to the comments in P5, right?

Yes it is; I also believe other research papers (e.g., Spool) use direct I/O for measurements.

mmisono commented 1 year ago

For Integrity:

dm-crypt
- Integrity check with Authenticated disk encryption
dm-integrity
dm-verity (read-only)
fs-verity (read-only)
IAM/EVM
BLK_DEV_INTEGRITY
D-SHIELD

TODO: investigate what is suitable for our case

Reference

https://lwn.net/Articles/517381/


Both dm-verity and dm-crypt provide block level integrity protection.
dm-verity provides block level integrity protection for read-only file
systems, while dm-crypt provides block level integrity protection, with
minimum penalty, for filesystems requiring full disk encryption.

dm-integrity provides a lighter weight read-write block level integrity protection for file systems not requiring full disk encryption, but which do require writability.


- [Data integrity protection with cryptsetup tools](https://archive.fosdem.org/2018/schedule/event/cryptsetup/attachments/slides/2506/export/events/attachments/cryptsetup/slides/2506/fosdem18_cryptsetup_aead.pdf)
- [use dm-crypt + dm-integrity + dm-raid](https://gist.github.com/MawKKe/caa2bbf7edcc072129d73b61ae7815fb)
- [AN INTRODUCTION TO DM-VERITY IN EMBEDDED DEVICE SECURITY](https://www.starlab.io/blog/dm-verity-in-embedded-device-security)
- https://ieeexplore.ieee.org/abstract/document/10070924/

dm-crypt also offers integrity checking of read-only filesystems where the entire block device is verified at once. This approach is particularly time-consuming and thus is typically used only during device startup [6], [44]. dm-verify [6] uses a software maintained Merkle tree structure to compute and validate hashes of read-only data blocks against pre-computed hashes. In contrast, dm-integrity keeps individual hashes for each data block during runtime, which allows verification for read/write system. However, it cannot detect physical attacks such as reordering the blocks within the same device due to the lack of a secure root of trust in the system.

mmisono commented 1 year ago

spdk:

fio
- fio_plugin
encryption
- block device layer has encryption API: https://spdk.io/doc/bdev.html
integrity
- ? (TODO)

Reference

Squeezing Compression and Encryption into SPDK (SPDK Summit 2019)

enterJazz commented 1 year ago

I just finished creating the base benchmark runner - see #10

enterJazz commented 1 year ago

To execute it.:

cd ./tools/storage-io-bm-runner
nix-shell
source .venv/bin/activate
cd ./bm-runner-tool
mkdir -p resources
python3 main.py --name my-bm --stack=native-io --storage-level=file-level --measurement-type=io-average-latency --resource-dir=./resources

more options can be viewed using

python3 main.py --help

all parameters of P6 are currently implemented - others are lacking

mmisono commented 1 year ago

From TDX Linux Guest Kernel Security Specification

The virtIO subsystem is also highly configurable with different options possible for the virtual queue’s types, transportation, etc. For the virtual queues, currently the only mode that was hardened (by performing code audit and fuzzing activities outlined in Intel® Trust Domain Extension Guest Linux Kernel Hardening Strategy) is a split virtqueue without indirect descriptor support, so this mode is the only one recommended for the secure virtio communication.

TODO:

[ ] Is it the same for the SEV-SNP?
[ ] How much performance overhead does this hardening introduce?
[ ] For SPDK -- what do we need for the hardning?

mmisono commented 1 year ago

We also want to have CPU time breakdown figures like bifrost

Breakdown measurement

Here is a script to measure breakdown (for SEV)

measure vmexit breakdown in host side
- script for the server (host)
- They modify host KVM to measure VM exit latency
measure application / packet processing / swiotlb (bounce buffer) breakdown in a VM
- I could not find the measurement script used here (breakdown.sh).
  - ~maybe we need to check the guest image~
  - edit: I retrieved files and put here: https://github.com/mmisono/bifrost-vm-files
    - breadkdown.sh
      - This is just a control nob for the kernel module defined below
  - They modify guest kernel to measure breakdowns

swiotlb options

We can control swiotlb via kernel command parameters

        swiotlb=        [ARM,IA-64,PPC,MIPS,X86]
                        Format: { <int> [,<int>] | force | noforce }
                        <int> -- Number of I/O TLB slabs
                        <int> -- Second integer after comma. Number of swiotlb
                                 areas with their own lock. Will be rounded up
                                 to a power of 2.
                        force -- force using of bounce buffers even if they
                                 wouldn't be automatically used by the kerne

mmisono commented 1 year ago

Other than block-level encryption, there are also fs-level encryption method

This shows some benchmarking results among dm-crypt/encryptfs/fscrypt:

dm-crypt (block level encryption) has better performance than fscrypt/encryptfs
- but maybe this comparison is not fair as using different encryption algorithm

TODO

[x] summarize pros and cons of fs-level encryption
[x] Are there scenarios/use-cases that fs-level encryption works better in CVMs?

NOTE

For fair comparison, we should use the same encryption algorithm & fs

Notes from Robert

Pros and Cons of FS-level encryption

Pros

more flexibel: preserves file-system structure
- allows encrypting certain fs-trees and leaving others unprotected
- allows encrypting w/ different keys

Cons

does NOT encrypt all data; leaves info which can be extracted
- non-filename metadata: timestamps, size + number of files, extended attributes

Resumé:

block-level is preferable
fs-level's main advantage is flexibility, which is necessary for multi-client systems (multiple users -> multiple keys
we do not require this flexibility -> all data is to be protected with regards to CVM threat model
performance depends on underlying cryptography (almost always AES-XTS 256 bit)
moreover, trust is placed in entire CVM -> hence, dividing trust within CVM is not a priority (where fs-level would come into play)

Scenarios where FS-Level may work better in CVMs

if multi-client flexibility is required
if only encrypting file data offers significant advantage (and leaving file metadata unprotected)
- e.g. allows us to protect file metadata separately, and only pass this over bounce buffer, writing file system data directly into shraed memory while encrypting (like bifrost)

NOTE: splitting data from file metadata may be key to optimization - so fs-level is not out of the race

enterJazz commented 1 year ago

Integrity is now implemented as well :checkered_flag:

enterJazz commented 1 year ago

Adding VM execution support next; as the VM setup is pretty complex and requires changing things in the BIOS / management interface etc, I will NOT include the VM setup in the tool.

Instead, to get VM results, one will execute the tool inside of the VM. If this approach is not enough to ensure reproducibility, we can maybe come up w/ a hybrid solution- e.g. one passes the tool parameters to ssh inside the VM, wherein it itself executes the BMs

mmisono commented 1 year ago

Instead, to get VM results, one will execute the tool inside of the VM. If this approach is not enough to ensure reproducibility, we can maybe come up w/ a hybrid solution- e.g. one passes the tool parameters to ssh inside the VM, wherein it itself executes the BMs

vmsh and ushell did a similar thing for the automated test (#3) https://github.com/TUM-DSE/ushell/blob/main/misc/tests/qemu.py

for now some manual work is fine but we want to have these things in the future..

Notes from Robert

joerg and peter have a (complex) version of this which we can use to build the setup: https://github.com/Mic92/vmsh/tree/main/tests

enterJazz commented 1 year ago

New TODO:

[ ] have fio write to device directly instead of file ( related to P5 )
[ ] do vhost-spdk benchmark

enterJazz commented 1 year ago

Benchmark Issues

w/ the following config, we receive unexpected results (investigated closer on bw tests, however also present on iops, alat:

loop=5, size=4G:
- sev faster than nosev
- scsi faster than blk (on nosev, sev both)
loop=1,size=1G:
- benchmarks perform as expected

TODO:

[ ] figure out why increase in size causes unexpected results

mmisono commented 1 year ago

vhost configuration

(vhost-blk (never merged in qemu))
vhost-scsi
(vhost-nvme (does not exist?))
vhost-user-blk w/ SPDK
vhost-user-scsi w/ SPDK
~vhost-user-nvme w/SPDK~ (deprecated) => libvfio-user + SPDK
(vhost-user-blk w/ libblkio (io_uring); I think this configuration is possible but not sure)

mmisono commented 1 year ago

(when using new disk), Before running any benchmarks, we should fill random data to the NVMe disk.

Intel

The usual practice is to keep writing to the disk after formatting, filling it up and making it stable. Take the Intel SSD DC P3700 800GB as an example. Usually, it is sequentially written with 4KB block size for two hours, and then randomly written for one hour. In addition, during the test, the ramp_time in the fio parameter can be set larger to avoid an initial unreasonably high value being calculated in the final result. https://www.intel.com/content/www/us/en/developer/articles/technical/evaluate-performance-for-storage-performance-development-kit-spdk-based-nvme-ssd.html
AMD

It is recommended to run the following workloads with twice the advertised capacity of the SSD to guarantee that all available memory is filled with data including the factory provisioned area. • Secure erase the SSD • Fill SSD with 128k sequential data twice • Fill the drive with 4k random data https://www.amd.com/en/server-docs/nvme-ssd-performance-evaluation-guide-for-windows-server-2016-and-red-hat-enterprise
Micron

Precondition: Following SNIA’s Performance Test Specification for workload-independent precondition, we write the drive with 128KB sequential transfers aligned to 4K boundaries over 2X the drive’s advertised capacity https://www.micron.com/-/media/client/global/documents/products/technical-marketing-brief/brief_ssd_performance_measure.pdf

Reference

mmisono commented 1 year ago

“purge” ssd before precondition is desirable https://www.micron.com/-/media/client/global/documents/products/technical-marketing-brief/brief_ssd_performance_measure.pdf

TODO

[ ] Check hot to purge Samsung NVMe
[ ] Update evaluation script

enterJazz commented 1 year ago

For real world benchmarks we can use:

TUM-DSE / CVM_eval

Storage IO Benchmark Variants #9

Parameters

Environment Specific Params

Parameter 1 (P1): Guest Type

P2: Guest Configuration / Resources

Non-Env Specific Params

P3: Storage IO Software Stack

P4: Encryption

P4.5 Integrity

P5: Storage Level

P6: Measurement Type

Non-Param Config

Further Optimization Areas (non-params)

Notes from Robert

Notes from Robert

Reference

Reference

Breakdown measurement

swiotlb options

Notes from Robert

Pros and Cons of FS-level encryption

Pros

Cons

Scenarios where FS-Level may work better in CVMs

Notes from Robert

New TODO:

Benchmark Issues