Open yarikoptic opened 9 months ago
Here are preliminary results for a simple "download subtree part of zarr" using aws s3
and rclone
directly from S3, vs access from dandidav's dandisets/ and then zarrs/ endpoints:
so we see that /dandisets/
endpoint is quite slow, but the manifests based and shorter path of /zarrs/
one is quite performant (well -- still twice slower but that is just that -- twice! ;))
(dev3) yoh@typhon:~/proj/dandi/zarr-benchmarking/tools$ PART=0/0/0/0/ CONCUR=10 ./simple_parallel_get.sh
Downloading part 0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 10 processes
---------------
get_aws_s3: /tmp/zarr-bm/get_aws_s3
22.49user 2.03system 0:29.91elapsed 82%CPU (0avgtext+0avgdata 119156maxresident)k
0inputs+24064outputs (0major+110906minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
---------------
get_rclone_s3: /tmp/zarr-bm/get_rclone_s3
3.64user 0.94system 0:16.16elapsed 28%CPU (0avgtext+0avgdata 85516maxresident)k
0inputs+24064outputs (0major+5286minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
---------------
get_rclone_dandisets: /tmp/zarr-bm/get_rclone_dandisets
4.63user 1.99system 3:16.25elapsed 3%CPU (0avgtext+0avgdata 74280maxresident)k
0inputs+24064outputs (0major+5663minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
3.47user 1.55system 0:32.48elapsed 15%CPU (0avgtext+0avgdata 74440maxresident)k
0inputs+24064outputs (0major+5382minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
on a serial access we are up to par with aws
on /zarrs/
but rclone beats us:
(dev3) yoh@typhon:~/proj/dandi/zarr-benchmarking/tools$ PART=0/0/0/0/ CONCUR=1 ./simple_parallel_get.sh
Downloading part 0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 1 processes
---------------
get_aws_s3: /tmp/zarr-bm/get_aws_s3
20.58user 1.82system 3:06.44elapsed 12%CPU (0avgtext+0avgdata 109096maxresident)k
0inputs+24064outputs (0major+97814minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
---------------
get_rclone_s3: /tmp/zarr-bm/get_rclone_s3
4.69user 1.75system 2:43.25elapsed 3%CPU (0avgtext+0avgdata 81728maxresident)k
0inputs+24064outputs (0major+6314minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
---------------
get_rclone_dandisets: /tmp/zarr-bm/get_rclone_dandisets
4.86user 2.44system 12:44.14elapsed 0%CPU (0avgtext+0avgdata 71256maxresident)k
0inputs+24064outputs (0major+3675minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
4.45user 2.20system 3:04.90elapsed 3%CPU (0avgtext+0avgdata 71740maxresident)k
0inputs+24064outputs (0major+4791minor)pagefaults 0swaps
checksum 742c8d77baf7240437d33117f1d063fb-3008--1684480
the /dandisets/
path remains many times slower - we should look into optimizing there. But I guess the first thing would be to establish efficient dandi-archive API per
@yarikoptic Yes, a dedicated dandi-archive endpoint would definitely help with efficiency.
Although no %wait I wonder if logging (#69) already something effecting it.
@yarikoptic Did you build the dandidav
binary with --release
? If not, you should.
@yarikoptic - just a note that testing throughput could be broken into two parts (reading, reading + verifying). the former is what any zarr reader would do. it wouldn't try to check that it received the correct bytes. the latter is what dandi cli would do to ensure that we do receive the correct bytes. and the tools used above do various things depending on flags. also for parallel read from s3, s5cmd is significantly faster than most other tools (https://github.com/peak/s5cmd).
@satra thanks for the note. In the script/output above I report the zarr checksum we have for all downloads -- all good as long as they match. If it is not partial download of a zarr (like in those above) -- it would explicitly compare to the target overall zarr checksum. Verification is not taken as part of the benchmark. If you have some python / any other code you would like to benchmark zarrs -- please share.
re s5cmd, I will add it then. Probably in favor over aws s3
which consistently shows that it is slow.
FWIW
@jwodder said that he already looked into scalability a little but have not found a resolution yet.
Given that rclone seems generally can reach s5cmd performance on access to S3, I think solving scalability solution for the /zarr/
endpoint should provide a comparable solution altogether.
FWIW I have pushed current version of the helper to https://github.com/dandi/zarr-benchmarking/blob/master/tools/simple_parallel_get.sh
@yarikoptic To be clear: You're concluding that dandidav
can only support a maximum of about 20 requests based on the fact that the total wallclock runtime to process concurrent requests stops decreasing at 20, correct? If so, that interpretation seems backwards: if dandidav
can't handle more than 20 concurrent requests, then sending more requests would result in some not being handled immediately, causing the total runtime to increase.
correct.
I don't think so since the overall number of requests is constant (number of files in that zarr path) and the only thing changing is how many we are submitting in parallel. So it is a question of throughput really. My trivilized thinking/explanation which simplifies as in submitter not waiting for response (like we do) but overall of the same reasoning. If dandidav can handle max 20 requests in 1 sec, the time to process 100 requests would remain constant 5 seconds, regardless in how many requests (above 20) I would request at once (in parallel).
@yarikoptic I think I misunderstood what you were doing with concurrent processes. I thought you were running, say, 20 copies of rclone
against the same endpoint in parallel, but it's actually just one rclone
process that internally divides its requests into concurrent batches of 20.
correct! indeed it might also be valuable to test against multiple clients in parallel bombarding the same/different zarrs.
@yarikoptic I'm going to ask about request limits on a Rust forum, and I need to know how many CPUs (not cores) the machine dandidav
is running on has. I believe running python3 -c "import os; print(os.cpu_count())"
should give the correct value.
@yarikoptic Also, could you run for c in 1 5 10 20 30 40 50; do CONCUR=$c ...
using the get_rclone_dandisets
method instead of get_rclone_zarr_manifest
? I want to confirm that both endpoints exhibit the 20-request limit with a "release" build.
I believe running
python3 -c "import os; print(os.cpu_count())"
should give the correct value.
32
@yarikoptic Also, could you run
for c in 1 5 10 20 30 40 50; do CONCUR=$c ...
using theget_rclone_dandisets
method instead ofget_rclone_zarr_manifest
?
Sure, doing against LOCAL
*$> for c in 1 5 10 20 30 40 50; do CONCUR=$c METHODS=get_rclone_zarr_manifest RCLONE_DANDI_WEBDAV=DANDI-WEBDAV-LOCAL tools/simple_parallel_get.sh; done
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 1 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:13.37 real 0.29 user 0.22 sys 3% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 5 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:02.47 real 0.25 user 0.17 sys 17% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 10 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:02.15 real 0.36 user 0.10 sys 21% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 20 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:02.10 real 0.30 user 0.18 sys 23% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 30 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:02.09 real 0.39 user 0.12 sys 24% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 40 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:02.14 real 0.36 user 0.14 sys 23% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
Downloading part 0/0/0/0/0/ within zarr 0d5b9be5-e626-4f6a-96da-b6b602954899 asking for up to 50 processes
---------------
get_rclone_zarr_manifest: /tmp/zarr-bm/get_rclone_zarr_manifest
TIME: 0:02.10 real 0.34 user 0.15 sys 23% CPU
checksum f494a7ab20c6906e4c176e7a0f08c29d-188--105280
shub@falkor:~$ dandidav-test/dandidav --version
dandidav 0.2.0 (commit: 4998bd8)
shub@falkor:~$ file dandidav-test/dandidav
dandidav-test/dandidav: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=825b59e00515dad24f6ae389e6a79c65765232dd, for GNU/Linux 3.2.0, with debug_info, not stripped
if needed next time - to expedite, just run the script on e.g. drogon: it is now on github. I can also make account for you on either typhon (where I did originally) or falkor (webserver), just let me know.
Hi @jwodder , anything from Rust community on the issue of scalability? What to look for?
@yarikoptic No, my post got no attention. The only thing I've learned is that axum (the web framework dandidav
uses) does not impose any request limits by default, so the bottleneck isn't coming from there.
but what is it -- limited number of threads ?
@yarikoptic tokio
(the async runtime) does limit the number of worker threads to the number of CPUs by default, but you said that's 32, which wouldn't explain why we're seeing bottlenecks at 20. Also, since the code is async, it's possible for more than 32 requests to be processed at once anyway.
is it tunable? if we double it and it doesn't have effect -- must be smth else. if not -- we could figure out why 20 and not 32 ;-)
@yarikoptic Yes, it's tunable. If you set the environment variable TOKIO_WORKER_THREADS
to an integer when running the server, that'll be how many worker threads are used.
If you're really paranoid, you can check what Rust thinks the number of CPUs is as follows:
Create a new directory on the server machine and cd
into it
Create a Cargo.toml
file with the following contents:
[package]
name = "cpu-count"
version = "0.1.0"
edition = "2021"
[dependencies]
num_cpus = "1.16.0"
Create a src/main.rs
file with the following contents:
fn main() {
println!("Number of CPUs: {}", num_cpus::get());
}
Run cargo run
.
FWIW.
so there is indeed more to it
@yarikoptic This is largely a shot in the dark, but on the server that dandidav
is running on, are 20 of the CPU cores performance cores and the rest efficiency cores? (I don't know how you'd check this.) I don't know how core scheduling works with efficiency cores, but since all of dandidav
's work is relatively short-lived, maybe only the performance cores are being used?
I even had no clue that there are different kinds of cores. FWIW, to rule out cores effect we could check scalability while running on some other server. Could you please run on typhon which has 32 cores? ndoli
has 40
@yarikoptic
I even had no clue that there are different kinds of cores.
It's a new thing that ARM came out with in 2011, and then Intel made its own version in 2022-ish. (Thus, if the server's CPUs are from 2021 or earlier, then they're just "normal" cores and my theory is wrong.) I only found out about the different types last week or the week before when I was looking at the specs for new MacBooks.
Could you please run on typhon which has 32 cores?
I don't think you ever gave me access to typhon, and I don't know how to access it. (At the very least, typhon.datalad.org
doesn't resolve for me.)
@yarikoptic I ran both dandidav
and the timing script on typhon, and the 20-request limit was still present:
get_rclone_dandisets
:
Processes | Time | User | Sys | CPU |
---|---|---|---|---|
1 | 0:38.22 | 0.35 | 0.17 | 1% |
5 | 0:10.30 | 0.42 | 0.16 | 5% |
10 | 0:09.48 | 0.39 | 0.12 | 5% |
15 | 0:08.83 | 0.37 | 0.22 | 6% |
20 | 0:09.43 | 0.38 | 0.17 | 6% |
30 | 0:09.24 | 0.42 | 0.16 | 6% |
40 | 0:10.11 | 0.53 | 0.19 | 7% |
50 | 0:09.68 | 0.53 | 0.20 | 7% |
get_rclone_zarr_manifest
:
Processes | Time | User | Sys | CPU |
---|---|---|---|---|
1 | 0:10.67 | 0.35 | 0.22 | 5% |
5 | 0:02.28 | 0.32 | 0.10 | 18% |
10 | 0:02.14 | 0.35 | 0.17 | 24% |
15 | 0:02.10 | 0.42 | 0.10 | 25% |
20 | 0:02.10 | 0.27 | 0.10 | 18% |
30 | 0:02.12 | 0.57 | 0.11 | 32% |
40 | 0:02.10 | 0.42 | 0.14 | 27% |
50 | 0:02.09 | 0.37 | 0.13 | 24% |
@yarikoptic I wrote https://github.com/jwodder/batchdav for measuring WebDAV traversal time using a given number of workers. Unlike rclone
, rather than spending time downloading files from S3, the only requests it makes for the actual files are HEAD
requests that don't follow redirects.
I've run it at various different points, and the times from different sessions have varied by about an order of magnitude, which I suspect is due to differing amounts of load on Heroku. Some of the fastest times I've ever gotten are:
Workers | Time (mean) | Time (stddev) |
---|---|---|
1 | 8.399036695100001 | 0.36142910510463516 |
5 | 1.6700318244 | 0.12919592271200123 |
10 | 1.0409548316000001 | 0.10855610294283857 |
15 | 0.7129774931999999 | 0.06181837739373458 |
20 | 0.750514105 | 0.10966455557731183 |
30 | 0.7945123642999999 | 0.10238084442203854 |
40 | 0.7258895968 | 0.08116879741778966 |
50 | 0.7132875974999999 | 0.07944869527032605 |
Make of that what you will.
Just to recall/summarize what we have so far:
/zarr
endpoints) seems to be (just) slightly over twice slower than the fast s5cmd (or rclone) when actually accessing some sample zarr per my much earlier investigation.
@yarikoptic I'm starting to think that the "20 workers ceiling" is just a statistical illusion resulting from the combination of two factors:
As the number of concurrent requests to an axum server grows, the average response time increases. (Based on my observations, this happens regardless of whether the number of concurrent requests is greater than or less than the number of threads used by the server.)
As the number of concurrent requests made by a client grows, assuming the response time remains constant, the overall runtime decreases.
It seems that the decrease from (2) is dominant until some point in the 20-30 concurrent requests range, after which the increase from (1) is dominant, thereby producing a minimum runtime around 20 workers.
it could indeed be the case. But why and how much the response time increases for the 1.
?
@yarikoptic I'm not entirely sure why it happens, but you can see the increase in the following statistical summary of the individual request times per number of client workers when using batchdav from my MacBook Pro to traverse [1]:
The effect also happens when both the dandidav
server and the batchdav
client are run on smaug:
interesting. So indeed it pretty much jumps twice in Avg going from 20 to 40, but not before, i.e. before 20 it grows slower than factor for number of workers (e.g. it is growing twice in going from 1 to 10). Just for academic purposes -- could you produce those Avg times only for our dandidav and smaug only but with step 1? then I would like to look at estimate of Avg[n]/Avg[n/k] (k==2 or some other) -- is it jumping rapidly higher at some specific value and either that value the same for the both, and where it crosses the horizontal line for k
- i.e. were increasing number of workers would not be of help.
So indeed your observation was probably right spot on that response time is what keeps us from "going faster", but the question remains why such rapid response time growth? it could again be if there is some hardcoded size of some queue somewhere to process only N requests in parallel, and that would increase average time since N+1st request would need to wait first to get its turn, which would add up to response time.
@yarikoptic
Just for academic purposes -- could you produce those Avg times only for our dandidav and smaug only but with step 1?
Over what range of worker values?
how academic you want to go? ;) if not much -- till 50, if more -- till 100 ;)
@yarikoptic
then I would like to look at estimate of Avg[n]/Avg[n/k] (k==2 or some other)
I'm not sure what formula you're trying to describe here. If k
is a fixed number, then the given expression will always be equal to k
, since Avg[n/k] = Avg[n]/k
, so Avg[n]/Avg[n/k] = Avg[n]/(Avn[n]/k) = k
.
The statistics for 1 through 50 workers against webdav.dandiarchive.org are:
and the statistics for 1 through 50 workers against dandidav run on smaug are:
Would you like me to provide the raw data, containing each of the individual request times?
here what I get for k=2
and smaug
assuming that excursion is noise (in particular for smaug at 15), it remains under 2, so in principle should still scale up although very slowly, and indeed that ratio plato is around around 12-15 in both cases, i.e. somewhere at 24-30 we reach so that ratio of response time at N becomes 1.8 of response time at N/2 parallel requests.
if zoom in at smaug one
we see that it pretty much grows linearly until then. So there is IMHO some inherent factor, possibly dependent on instance (what is # of processors/cores on heroku) but not clearly so, which just ceils scalability there.,
BTW -- while running on smaug what do you see for the CPU load on dandidav (since no actual download is involved I guess could differ from prior stated)?
@yarikoptic The load steadily increased as the number of workers increased, capping out at around 4.2 for the 5-second load average.
what about cpu%? ie is that task becoming somehow IO bound vs CPU?
@yarikoptic I ran watch -n1 'ps -q "$(pgrep -d, dandidav)" -o cuu,args'
while running batchdav
against dandidav
on smaug, and within a few second the CPU percentage was at 99.999%, where it remained for the rest of batchdav
's run.
so may be the ceiling is due to bottleneck in the implementation stack somewhere :-/ Are there convenient profiling tools in the rust
land to see what it is mostly "thinking about"?
@yarikoptic There are a number of profilers. I'm starting out with flamegraph, but I'm getting the error:
Error:
Access to performance monitoring and observability operations is limited.
Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open
access to performance monitoring and observability operations for processes
without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability.
More information can be found at 'Perf events and tool security' document:
https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html
perf_event_paranoid setting is 3:
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>= 0: Disallow raw and ftrace function tracepoint access
>= 1: Disallow CPU event access
>= 2: Disallow kernel profiling
To make the adjusted perf_event_paranoid setting permanent preserve it
in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = <setting>)
failed to sample program
For the record, flamegraph's docs suggest setting perf_event_paranoid
to -1.
Should be all set (but not for every user; and not persistent across reboots), you were added to the perf_users
, so either re-login or start a new shell using newgrp perf_users
.
@yarikoptic Side note: From the flamegraph, I've discovered that the functionality for logging how much memory dandidav
uses (added for #118) is taking up a lot of time; when I remove the functionality, requests to dandidav
speed up by an order of magnitude. However, response times still increase with the number of concurrent requests.
@yarikoptic I ended up posting a question about this on axum's repository: https://github.com/tokio-rs/axum/discussions/2772
@satra did just a quick&dirty test using zarr viewer comparing access directly to S3 URL and then our testbed dandi.centerforopenneuroscience.org instance (on a zarr in 000108 under
/dandisets/
not the manifests based one since no manifest for it was there yet) so there are no quantified characteristics available butWe need to come up with
Compare
to identify how much any component contributes and if we could improve any aspect (e.g. parallel access etc). I hope that it is not all simply due to redirects.
Later we need to create some nice "versioned zarr" with a few versions and to use for benchmarks and also benchmark on the
/zarr
endpoints.