Sparse files gets de-sparsified

lukts30 commented 9 months ago

After execution when output files are copied from the worker to the filesystem CAS directory, they lose their sparse file properties and get fully de-sparsified. This obviously increases disk usage, and the entire process of copying the file & hashing takes considerably more time than a simple move operation followed by a sha256sum calculation on a sparse file.

(byte-by-byte copy and not moved or copied via reflink/copy_file_range).

https://github.com/TraceMachina/native-link/blob/f989e612715a7fe645e69c4c78a50e9b7262ad17/config/examples/basic_cas.json

genrule(
  name = "create_sparse_file",
  outs = ["re_large_file"],
  cmd = "fallocate -l 5G $(OUTS) "
)

lukas@PC6061B ~/Downloads/turbo-cache $ filefrag -v tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file
Filesystem type is: 9123683e
File size of tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file is 5368709120 (1310720 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 1863472064..1863537599:  65536:             unwritten
   1:    65536..  196607: 1872450496..1872581567: 131072: 1863537600: unwritten
   2:   196608..  327679: 1881560000..1881691071: 131072: 1872581568: unwritten
   3:   327680..  458751: 1882974994..1883106065: 131072: 1881691072: unwritten
   4:   458752..  524287: 1883657152..1883722687:  65536: 1883106066: unwritten
   5:   524288..  589823: 1883741638..1883807173:  65536: 1883722688: unwritten
   6:   589824..  655359: 1890763862..1890829397:  65536: 1883807174: unwritten
   7:   655360..  720895: 2028229568..2028295103:  65536: 1890829398: unwritten
   8:   720896..  917503: 2028884928..2029081535: 196608: 2028295104: unwritten
   9:   917504.. 1114111: 2030457792..2030654399: 196608: 2029081536: unwritten
  10:  1114112.. 1310719: 2030982080..2031178687: 196608: 2030654400: last,unwritten,eof
tmp/turbo_cache/work/30d5d89de2794c8b81c8ceac8c02d0cc96085888988a72075d39e8b939202650/bazel-out/k8-fastbuild/bin/re_large_file: 11 extents found

File size of tmp/turbo_cache/data-worker-test/content_path-cas/7f06c62352aebd8125b2a1841e2b9e1ffcbed602f381c3dcb3200200e383d1d5-5368709120 is 5368709120 (1310720 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   65535: 2035009969..2035075504:  65536:            
   1:    65536..  131071: 2035592031..2035657566:  65536: 2035075505:
   2:   131072..  196607: 2037188178..2037253713:  65536: 2035657567:
   3:   196608..  262143: 2037273536..2037339071:  65536: 2037253714:
   4:   262144..  327679: 2038713195..2038778730:  65536: 2037339072:
   5:   327680..  458751: 2047055695..2047186766: 131072: 2038778731:
   6:   458752..  655359: 2047235008..2047431615: 196608: 2047186767:
   7:   655360..  851967: 2047497152..2047693759: 196608: 2047431616:
   8:   851968..  858862: 2047726528..2047733422:   6895: 2047693760:
   9:   858863..  859572: 2048257283..2048257992:    710: 2047733423:
  10:   859573..  859584: 1548714348..1548714359:     12: 2048257993:
  11:   859585..  859589: 1548714158..1548714162:      5: 1548714360:
  12:   859590..  925125: 2046972864..2047038399:  65536: 1548714163:
  13:   925126..  990661: 2047759296..2047824831:  65536: 2047038400:
  14:   990662.. 1187269: 2053067711..2053264318: 196608: 2047824832:
  15:  1187270.. 1310719: 2055361472..2055484921: 123450: 2053264319: last,eof

MarcusSorealheis commented 9 months ago

@lukts30 We're looking into this one.

allada commented 9 months ago

I'm trying to focus on the root-problem and how we can support this kind of feature without breaking other API requirements (like NFS filesystems or other filesystems/stores that are not sparse supported) and I want to pose the problem to you to see if this is the underlying issue:

Problem: When the Worker & CAS are on the same machine, it requires a full copy of the data.

If it is the problem, possible solution: What I will propose is to make the Worker aware of when it is uploading to a local filesystem CAS and hardlink the file if it is on the same filesystem. This would make the copy only cost a hardlink (which is extremely cheap).

As a side note to this problem, you may be interested to also use dedup_store and/or compression_store which can be used together to optimize filesystem disk size for files that are often the same between builds.

lukts30 commented 9 months ago

Thanks for looking into this issue.

When the Worker & CAS are on the same machine, it requires a full copy of the data.

That is an accurate summary of this issue.

Worker aware of when it is uploading to a local filesystem CAS and hardlink the file if it is on the same filesystem

Indeed hardlinking/moving files or "copying" via reflink/copy_file_range (on supported FS) all would similarly address the expensive copy operation.

I am wondering if the suggested hardlink approach would also work in a scenario where the Workers are distributed, but the storage for both the CAS and worker filesystems is managed through the same shared NFS/SMB file system share?

allada commented 9 months ago

I am wondering if the suggested hardlink approach would also work in a scenario where the Workers are distributed, but the storage for both the CAS and worker filesystems is managed through the same shared NFS/SMB file system share?

NFS/SMB as a shared medium for distributed workers to write too is not currently supported. This is because we evict items and currently don't have the ability to have external sources notify of things changing in the FilesystemStore.

Instead of using NFS, the preferred model is to use a remote store and xfer the data through some kind of pipe (eg: TCP). The complexity of adding network filesystem support sounds like it'd be very difficult to write and possibly add a lot of technical debt.

Given that, we do currently plan on supporting a Fuse filesystem that will materialize the files on demand. This would allow compression of data over the network & deduplication, at the cost of latency. In the case of NFS I would suspect the latency to be about the same, so it "might" be what you are looking for.

lukts30 commented 6 months ago

Commit https://github.com/TraceMachina/nativelink/commit/a0788fabc1831714e39fa5047e0a385a2c62234f did not change anything in this regard right? Still copies and does not use move/hard link to relocate the file.

I tested again a bit and RE through nativelink is still noticeably slower then what I would hope for.

My testing involved using a buck2 BUILD file first building :random_data and a separate build of :instant_copy, which utilizes the cached file for a reflink copy (takes only 5ms). But :instant_copy took 70 seconds to complete, with more then 99% time spend "transferring" the file to the CAS. Observed operations involved reading the entire $OUT file at a rate of 300MB/s, followed by simultaneous reading and writing, each at 300MB/s. (monitored through iotop; does not include time/data for RE client to fetch the result artifact)

accumulated read: 20GB
accumulated written: 10GB

Because the file is read twice the speed is in practice only 150MB/s. But even the 300MB/s is rather slow compared to other tools. For comparison both curl/wget can download a file from a local HTTP Server (caddy) at 800MB/s without any special parameter and the builtin HTTP download rule in buck2 even achieves 1.2GB/s by default. Additionally as I configured the hash to be BLAKE3 I also ran b3sum with a single thread on the 10GB file and that took 4-6s (avg > 1.6GB/s read).

genrule(
    name = "random_data",
    out = "gen_large_file",
    cmd =  "dd if=/dev/urandom of=$OUT bs=1G count=10 iflag=fullblock",
)

genrule(
    name = "instant_copy",
    out = "gen_large_file",
    cmd =  "cp --reflink=always $(location :random_data) $OUT",
)

allada commented 6 months ago

Yes, the commit I made was just lining things up to support the ability to support special filesystem calls like sparse copy.

Can you confirm that you compiled with --release (-o3'ish)? There is a significant performance difference if you don't compile in release mode.

I saw that you are using basic_cas.json, but can you confirm that you did not make any changes to it (that would impact these results)?

lukts30 commented 6 months ago

I am using a release build and the tests were done with commit https://github.com/TraceMachina/nativelink/commit/2a89ce6384b428869e21219af303c753bd3087b5 because of https://github.com/TraceMachina/nativelink/issues/665.

I used basic_cas.json only changed the output paths, set the hash to blake3 and tried removing the memory store but that did not change much.

allada commented 6 months ago

I did some local testing and yes I do see it taking much longer than it should be.

Here's my local results:

real 24.834s = time dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
real 3.31s   = cat ./dummy_foo | time b3sum --no-mmap --num-threads 1
real 8.040s  = time cp ./dummy_foo ./dummy_foo2

Nativelink results (modified source to capture timing):

24.854s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
47.4s   = computing hash
25.312s = uploading file (copy file)

If I change this line to 1Mib:

24.751s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
7.128s  = computing hash
10.7408 = uploading file (copy file)

I then wanted to see if I put the hashing function onto a different spawn (thread) than the spawn (thread) that reads the file contents how might it improve:

24.7814s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
3.5946s  = computing hash
10.677s  = uploading file (copy file)

I then checked to see if I put the uploading/copying part onto different spawns/threads if it would help:

24.8469s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
3.2696s  = computing hash
7.2525s  = uploading file (copy file)

So the obvious low hanging fruit here is to simply increase the default DEFAULT_READ_BUFF_SIZE to a larger value and possibly make it configurable in global config.

I'm torn on if we should support multi-spawn/thread in this section of code, since we intentionally try to not create any spawns per grpc connection in order to help keep users in a single thread. This keeps extreme parallelism in a fairly cooperative manner, otherwise one user could create millions of small files and use lots of threads computing digests and such, starving everyone else. I'll think about this one a bit more.

lukts30 commented 6 months ago

I can confirm that using a 1 MiB buffer helped significantly and brought down build times from 17m45.5s down to 10m50.7s (includes negligible few seconds for RE client to fetch artifacts). Local builds take 6m52.4s.

Based on these findings I would recommend making DEFAULT_READ_BUFF_SIZE configurable.

Also is there a way to log how long execution, hashing & uploading took?

allada commented 6 months ago

Yes. @aaronmondal or @blakehatch, do one of you want to make this DEFAULT_READ_BUFF_SIZE a global config (also need to do some code searching to see if there's other places that could use the same config).

As for optimizing the threading, I think we should optimize it. Right now we are already paying a very high cost because tokio actually uses the synchronous filesystem API in a different threadpool. Since this is already a cost, we should instead stream the data from the synchronous filesystem API. One of the big advantages to doing it this way is that we could easily wire this up to mmap later, which would give even higher throughput.

We do not currently separate out the hashing time vs other time in the worker/running_actions_manager... currently it's lumped into upload_results. We could separate this out I think. @aaronmondal this might an easy one for you :-)

As an FYI, @lukts30, we are currently working on some tooling around using ReClient/Chromium as a benchmark metric that will help us understand where we need to improve.

allada commented 6 months ago

I spent a little more time on this. I have a local change I'll push up soon that fixes the hash time completely. It should bring the total hash time to be in parity with b3sum in single threaded mode.

I'm currently looking at optimizing upload time. In doing so, there's a very high chance I'll also implement it as a file move. This should make local execution overhead nearly zero.

allada commented 6 months ago

How about this: https://github.com/allada/nativelink-fork/commit/cf25b316cf3e06f217d314200e26ec288eaa19d8

It still needs some cleanup and I decided to cleanup some code, so it'll be multiple PRs before it's in.

It appears rayon + mmap and just mmap is about the same speed on my computer, so I may disable multi-threading.

24.7308s = running dd if=/dev/urandom of=/tmp/dummy_foo bs=1G count=10 iflag=fullblock
0.56695s = computing hash
0.00005s = uploading file (copy file)

I'm going to bet that we need to optimize localhost data xfer though, so this is likely only one step towards extreme speeds :smile:

allada commented 6 months ago

I did not use the rayon hashing, but if it helps your use case a lot, could you open a new ticket and we can possibly make it enabled via a config.

TraceMachina / nativelink

Sparse files gets de-sparsified #409