Nonlinear performance scaling when writing to multiple drives

richard42 commented 6 years ago

Over the past several years I have developed software for my employer for the purposes of high-performance tar and LTFS tape reading and writing for LTO tape drives. This software runs primarily under MacOS but we also support some Linux installations as well. This software is used in large-scale media production, and it's not uncommon for us to write 10-20 terabytes of source content per day on LTO 6/7/8 tapes for periods of a month or more. We have observed a non-linear performance scaling behavior when writing to many tape drives at the same time and would like to understand and eliminate this phenomenon to maximize our tape write performance. We speculate that there may be some underlying architectural feature of the LTFS driver which is causing the performance degradation, because all of the other pieces of our system should be able to support the total aggregate bandwidth.

I have attached a spreadsheet which shows some performance numbers from a recent test series. These were taken on a system with 4 LTO6 drives. We typically write to mirrored (duplicate) tape sets. Our software works by having a Reader thread which reads data from a filesystem and places buffers of this input data into separate queues, one per tape drive. There is also one Writer thread per tape drive, which dequeues these buffers and writes the data to the LTFS filesystem via POSIX C open/write functions. OS filesystem caching is disabled (on both the read and write sides) to avoid polluting the RAM with unnecessary disk cache data.

It is understandable that the performance will go down when comparing the 1-drive case to the 2-drive (mirrored set) case. Each individual drive will naturally have a varying write speed, due to the physical characteristics of the drive and defects in the tape medium. When writing a mirrored set, we are coupling the drives together because we only read the data once (and the Writer queue size is finite), causing the the overall streaming speed to be limited to the minimum of the speeds of the 2 drives (within a certain window). It's like you have a 2-car train; one car cannot go faster than the other. To decouple the drive performance as much as possible and prevent "rubber-banding", our Writer queue size is very large (1GB per drive). The attached spreadsheet shows that we are achieving 150 MB/s average with a single drive, but only 133 MB/s with 2 drives in a mirrored configuration. This is expected.

The unexpected result is the further performance decrease when we write to 4 drives. In this case, we are writing 2 mirrored sets, and each mirrored set is a totally independent operation with its own threads. Our Mac Pro machines are 12-core 2.8 GHz with 64GB of RAM, and the filesystems from which we're reading the data are capable of 600 - 1500 MB/s of read bandwidth, so the input side should not be a problem. You can see from the results that one mirrored set (Group 02) is writing at a nearly nominal rate (115-155 MB/s) but the other is much more degraded (85 - 111 MB/s). After Group 02 finishes, the slow Group 01 goes much faster (the last 2 packages are 151-159 MB/s).

Can you think of any shared resources in the LTFS or FUSE drivers could explain this performance issue? Single_Double_Quad_Tape_Test.xlsx

piste-jp commented 6 years ago

No, I don't have any information at this time. Basically, we don't have much care about multiple LTFS instances on a machine and it's performance.

But the fact you described is little bit interesting to me. So I will investigate a little in my spare time.

I would like to know the environment first. Can you check my assumption below?

Machine: Mac Pro (Mid 2010?) 2-core 2.8 GHz
Memory: 64GB
OS: MacOS ??
HBA: 2 SAS HBAs (2 port ATTO HBA x 2) to PCIe ?
LTO drives: LTO6 (SAS) x 4

richard42 commented 6 years ago

Thanks for offering to look into this. My manager said that he met you at NAB a few years ago and thought that you would be willing to help. Here is the specs for the machine on which these tests were run:

Machine: Mac Pro, 6.1 Late 2013, 12-core, 2.7 GHz, Dual D700
Memory: 64GB
OS: 10.12.6
HBA: Magma Thunderbolt 2 PCIe expander with a single Atto H680 2 port + SAS fan out (1-to-4 port) cable
LTO drives: HP LTO6 (SAS) x 4 with LTFS version HP 3.2.0

We have observed this performance degradation on older versions of OSX as well, going back to at least mavericks 10.9. We've seen it with different versions of LTFS and FUSE. We could also run these tests under Linux if that would be useful for you; our application is cross-platform. I can also give you numbers for the read performance degradation; I believe it is even bigger than for the writing case.

piste-jp commented 6 years ago

Thanks, I will investigate current LTFS (on this GitHub) can handle 4 streams or not on my spare time.

Before reporting the results. I would like to say a few things.

I had never visited the NAB show at all. So the man your manager met is another person in IBM Japan. (I think I may know him of cause)
The current LTFS on this repository does not support HP drives at all and I don't think HP's LTFS is not built on this repository. So I don't know my results can be effective to the HP's LTFS.
In my understanding, you are connecting 4 drives to 1 SAS port with the 1-to-4 port cable. Is it correct? If so, you are using 6gbps (600MB/s without SCSI overhead) on 4 drives. And I think performance degradation is natural because SCSI overhead is estimated as 10% - 20%, so limit of actual transfer rate is 540MB/s - 480MB/s per 1 SAS port.

richard42 commented 6 years ago

Yes, we have had some difficulties due to the fact that the HP and IBM drivers don't support the drives from the other manufacturer, and cannot both be installed at the same time. A few years ago I took a tagged source code version of the IBM driver (before it was put on github) and modified the drive table to add the HP drives to the supported list. I had an IBM 24-slot 2-drive LTO5 changer and 2 HP standalone decks connected to my Mac Pro. It worked fine with this modified IBM LTFS driver.

I was under the impression that the HP and IBM LTFS software packages had a common source ancestor and were mostly the same. If not, I could always get the latest source code here and add the HP drives into supported table again, then re-run the tests. Most of our drives are HP; I think the only IBM drives we have are that LTO5 changer and some new LTO8 drives.

The H680 card has 8 separate SAS ports, each supporting 6Gb/s (https://www.atto.com/products/adapters/sas-sata/6gb-pcie-30/ESAS-H680-000). The physical configuration is two external connectors, each wired up with 4 SAS ports. We use SFF-8088 fanout cables, and each drive gets its own 6Gb SAS port. So this should not be a bottleneck. We could also test to validate that we get the expected throughput scaling when writing to Tar tapes with this hardware configuration, because we also support the old style of tape backup.

piste-jp commented 6 years ago

Yes, you are right. HP and IBM LTFS software packages had a common source ancestor. But I don't know HP's code follow IBM's latest one or not. My wish is support HP drives in this code tree. I don't know the only thing is adding HP's Product ID into the support table or not. I think it is safe to add the sg-hptape backend based on the sg-ibmtape backend.

I added some functions for investigating performance issues into IBM's code, profiler and dummy I/O mode. But I don't know they are available in the HP's code.

I think it is good to start from 4-stream write on Mac (using filebackend-dummy-io). I will dig deeper if I have a performace issue in this environment first. Because we can eliminate the noise of device (both disk and tape) in the filebackend-dummy-io environment.

One thing we can reduce the noise of device is using sparse file. Could you measure the performance using sparse files for me on your environment?

Of cause, drive compression shall be on. I believe filesystem never read any data from the disk and the tape drive make maximum compression to the data. As a result, we can transfer the huge data with minimum disk accesses and minimum tape motions.

richard42 commented 6 years ago

We never use compression with our tapes. All of them are formatted with the "-c" option with mkltfs to disable compression. You are suggesting writing sparse files an enabling compression in order to maximize the data volume going through the LTFS driver and SCSI system, and minimize disk and tape i/o?

I could modify our software to just write all zeros and not even read anything from disk. But I'm sure that the input file reading is not the bottleneck, because we can easily read at 1000+ MB/s from our storage while doing other operations.

Also, the performance scaling problem seems to be even worse when we are reading data from tape. We support a "verify" operation in which the data are read from the tape and checksummed with a hashing algorithm without ever being written back to disk. I will ask our QA person to run tests and gather performance metrics for this operation.

piste-jp commented 6 years ago

I just want to minimize the side effect of H/W even if we know the disk can provide enough data rate.

Sparse file: To minimize the disk I/O Zero data with compression: To minimize tape motion (Tape drive will receive uncompressed zero data but it is compressed in the drive. So tape motion will be minimized even if host I/F processes a bunch of data.)

piste-jp commented 6 years ago

I made a sniff test with attached test script but LTFS can write 80GB in 35 sec in 4 processes (20GB write with a dd x 4 dds). I don't think LTFS is the bottleneck at all.

LTFS version: Head of v2.4.0 branch (9b9a319131cad8971ef2c296b1d3daab7b215ce2)
Mac: iMac Retina 5K, 27', 2017
Memory: 32GB
Processor: Intel Core i5 3.8GHz x 1
OS: macOS High Sierra (10.13.6)
FUSE for macOS: 3.8.0

Take a look the attached script and please let me know if you have a question.

perf-4-writes.tar.gz

richard42 commented 6 years ago

Thanks for looking into this. Your test results indicate that the LTFS filesystem itself can support a high bandwidth with 4 streams writing at once. But you're using the file backend instead of real tape drives, so this does not test the iokit SCSI backend.

I'll set up another test here to drill down further, and we will compare LTFS versus TAR, reading only, with 1, 2, and 4 simultaneous operations.

piste-jp commented 6 years ago

I think the possibility is low that the iokit backend is a bottleneck. Because LTFS passes the pointer of data buffer to the iokit backend (not only the iokit backend but also other backends..) and the iokit backend just pass it to the IOKit almost directly like below.

https://github.com/LinearTapeFileSystem/ltfs/blob/0027fd8d5a65aea2cc14b5feb1513b8a76915abb/src/tape_drivers/osx/iokit-ibmtape/iokit_ibmtape.c#L1286-L1382

https://github.com/LinearTapeFileSystem/ltfs/blob/0027fd8d5a65aea2cc14b5feb1513b8a76915abb/src/tape_drivers/osx/iokit-ibmtape/iokit_scsi.c#L153-L272

One possibility is IOKit is a bottleneck but I cannot make any help if IOKIt is really a bottleneck.

Anyway, now I'm finding 4 LTO drives (of cause IBM drives) in out lab. I will make deeper analysis when they are available.

piste-jp commented 6 years ago

I tried to make W/R data with 4 streams on 4 physical drives with ATTO H644 + SAS fan out (1-to-4 port) cable.

IBM LTO5 HH (L5 tape)
IBM LTO6 HH (L5 tape)
IBM LTO7 HH (L6 tape)
IBM LTO5 HH (L7 tape)

I modified the script to use real drives like below. The written data is fetched from /dev/zero and the data is dumped to /dev/null, it means I didn't use any physical device except tape drives.

#!/bin/bash

MKLTFS='/usr/local/bin/mkltfs -f '
LTFS='/usr/local/bin/ltfs -o sync_type=unmount '
STREAMS=3

for i in `seq 0 ${STREAMS}`; do
    echo "Formatting ${i}"
    ${MKLTFS} -d ${i}
    echo "Mounting ${i}"
    ${LTFS} -o devname=${i} ./sde${i} &
    sleep 2
done

wait

echo "The test will be started after 2 secs"
sleep 2

sync && purge

SECONDS=0
for i in `seq 0 ${STREAMS}`; do
    echo "Starting write stream${i}"
    dd if=/dev/zero of=./sde${i}/data bs=256k count=80000 &
done

echo "Waiting completion"
wait

echo "The test is finished. Duration (W) = ${SECONDS}"

for i in `seq 0 ${STREAMS}`; do
    echo "Dummy read for next test stream${i}"
    dd of=/dev/null if=./sde${i}/data bs=256k count=1 &
done
wait

SECONDS=0
for i in `seq 0 ${STREAMS}`; do
    echo "Starting read stream${i}"
    dd of=/dev/null if=./sde${i}/data bs=256k count=80000 &
done

echo "Waiting completion"
wait

echo "The test is finished. Duration (R) = ${SECONDS}"

for i in `seq 0 ${STREAMS}`; do
    echo "Unmounting ${i} with sudo"
    sudo umount ./sde${i} &
    sleep 2
done

wait

piste-jp commented 6 years ago

From a result, I can't find a performance degradation on this environment. LTFS and HBA can transfer the data about 270MiB/s (W), 210MiB/s (R) on 4 streams like below.

Read side degradation (70MiB/s (W) vs 210MiB/s (R)) is expected result. Because LTFS has a buffer, default 50MB, only for write and write requests are processed by a dedicated thread. On the other hand, there is only 512KB buffer for read. So LTFS need to issue a READ command when requested block is over the boundary of current block. We need to have some kind of read a head architecture to solve this problem. (But it may be a big architecture change)

So I think we can conclude that the drives causes a performance degradation. In my experience, the drive native is little bit sensitive to following factors. I recommend to dig deeper from drive side if you eager to solve this.

Internal buffer size of the drive
Power of reel motor
The time between previous write finished to next write start (and same as read side)

Write side perf

20971520000 bytes transferred in 73.967070 secs (283525088 bytes/sec)
20971520000 bytes transferred in 74.073209 secs (283118827 bytes/sec)
20971520000 bytes transferred in 74.221951 secs (282551452 bytes/sec)
20971520000 bytes transferred in 76.595445 secs (273795916 bytes/sec)

Read size perf

20971520000 bytes transferred in 94.368952 secs (222229023 bytes/sec)
20971520000 bytes transferred in 94.376731 secs (222210706 bytes/sec)
20971520000 bytes transferred in 94.855077 secs (221090116 bytes/sec)
20971520000 bytes transferred in 95.146789 secs (220412273 bytes/sec)

piste-jp commented 6 years ago

Please reopen this you find another fact about this.

richard42 commented 6 years ago

Thank you very much for all of the investigation you did for this issue. I will run these tests on our hardware and report the results here. I'm sorry for the delay, but all of our tape decks have been out in the field being used on shows for the past few weeks. As soon as we get 4 of them back in the lab I'll test.

LinearTapeFileSystem / ltfs

Nonlinear performance scaling when writing to multiple drives #65