Open ocaisa opened 1 year ago
There is some discussion ongoing for this in Slack, and the (unsurprising) conclusion is that closer you are to the S1 you use, the faster things are. In the case of AWS, we have an S1 in the same zone so we get nice fast speeds. For my Magic Castle instance it has to make plenty of hops to get to RUG...and there may be limitations being imposed by the network.
The Alliance configuration uses a CDN for cases like Magic Castle, and we should probably do something similar. We may even want multiple CDNs: one for use inside Azure, one for AWS, one for everyone else (Cloudflare). Managing CDNs will help us control any associated costs (and boost speed where we can).
Some timings. AWS VM to AWS S1:
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 0m20.198s
user 0m21.628s
sys 0m3.143s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 0m19.488s
user 0m21.364s
sys 0m3.317s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 0m19.732s
user 0m21.638s
sys 0m3.138s
AWS VM trying to talk to RUG:
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 8m52.507s
user 0m22.010s
sys 0m3.103s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 19m14.079s
user 0m21.985s
sys 0m3.117s
Nethogs here reports 10KB/s:
NetHogs version 0.8.7-23-gf281ca3
PID USER PROGRAM DEV SENT RECEIVED
27525 cvmfs /usr/bin/cvmfs2 ens5 0.564 10.464 KB/s
29331 ec2-us.. sshd: ec2-user@pts/1 ens5 0.252 0.103 KB/s
? root unknown TCP 0.000 0.000 KB/s
TOTAL 0.816 10.567 KB/s
Targeting S0 (also RUG) we also see poor performance. This is after 5m and the cache has been populated with 20MB...
[ec2-user@ip-172-31-1-106 ~]$ cvmfs_config stat
Running /usr/bin/cvmfs_config stat cvmfs-config.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27078 48 31996 22 1 1 20735 10240000 0 130560 0 2 33.333 43 318 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/cvmfs-config.cern.ch DIRECT 1
Running /usr/bin/cvmfs_config stat pilot.eessi-hpc.org:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27525 40 39020 492 1 3 20735 10240000 43 130560 0 280 47.350 9882 55 http://cvmfs-s0.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1
Result from RUG S0 (after S1 tests):
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 50m41.673s
user 0m21.844s
sys 0m3.271s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27525 93 41144 492 3 42 796371 10240001 11 130560 0 3208 12.031 211025 63 http://cvmfs-s0.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1
Test from the UiB S1 show consistent results:
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27525 96 39948 492 3 2 23402 10240000 11 130560 0 267 83.806 12568 3863 http://bgo-no.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 2m10.015s
user 0m21.881s
sys 0m3.107s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 2m9.182s
user 0m22.165s
sys 0m2.852s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 2m11.770s
user 0m22.078s
sys 0m2.867s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27525 104 41316 492 1 42 796371 10240001 11 130560 0 3208 12.031 211017 1642 http://bgo-no.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1
so it does appear to point to something shaping the network traffic at RUG
The total cache needed to run the Tensorflow example is about 800MB
@bedroge We need to identify what is causing the traffic issues at RUG as this (likely) also impacts the speed of updates to our S1. Also another reason to push eessi.io
so we can start configuring CDN
@ocaisa Can you also mention how you enforce using a particular Stratum-1, just in case others want to do some testing too?
Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS:
# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf
# Reconfigure CVMFS
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
You can then check the (averaged) bandwidth and cache usage with
cvmfs_config stat pilot.eessi-hpc.org | column -t
I was thinking if we could add this to some monitoring. I'll see what I can come up with, but it's tricky when the jobs take tens of minutes within a container. They may be non-trivial to have timeouts on.
We would just do this in the eessi-demo repo with GitHub actions. We do one job per S1, run the example three times and give a time limit to the jobs (say 12 minutes for 3 runs). We just run that every couple of days.
I did look at that, but from https://docs.github.com/en/site-policy/github-terms/github-terms-for-additional-products-and-features#5-actions-and-packages:
Actions and any elements of the Actions product or service may not be used in violation of the Agreement, the GitHub Acceptable Use Polices, or the GitHub Actions service limitations set forth in the Actions documentation. Additionally, regardless of whether an Action is using self-hosted runners, Actions should not be used for:
My emphasis. This includes:
- any activity that places a burden on our servers, where that burden is disproportionate to the benefits provided to users (for example, don't use Actions as a content delivery network or as part of a serverless application, but a low benefit Action could be ok if it’s also low burden); or
- if using GitHub-hosted runners, any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.
See https://medium.com/average-coder/can-you-use-github-actions-for-monitoring-e9c6cfe79ef4 for a neat idea though.
Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS:
# Edit the config file to point to a single S1 option, e.g., # CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@" [EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf # Reconfigure CVMFS [EESSI pilot 2021.12] $ sudo cvmfs_config setup # Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo) [EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
You can then check the (averaged) bandwidth and cache usage with
cvmfs_config stat pilot.eessi-hpc.org | column -t
Probably a bit cleaner/easier: you can also make a local config file, in this case that would be /etc/cvmfs/domain.d/eessi-hpc.local
(instead of .conf
), where you can override that server list parameter (or anything else).
Update on the instructions to reproduce:
# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
# Reconfigure CVMFS
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
You can then check the (averaged) bandwidth and cache usage with
cvmfs_config stat pilot.eessi-hpc.org | column -t
Something appears to have changed at RUG today and I am no longer seeing performance issues on either S0 or S1 (indeed performance is significantly better than previous best case scenarios):
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m14.691s
user 0m21.103s
sys 0m3.339s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m14.965s
user 0m21.537s
sys 0m2.890s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m14.755s
user 0m21.270s
sys 0m3.228s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 26515 6 39500 492 2 42 796371 10240001 11 130560 0 3208 12.031 211017 3186 http://ssr4cc.hpc.rug.nl/cvmfs/pilot.eessi-hpc.org DIRECT 1
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m18.020s
user 0m21.655s
sys 0m2.996s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m9.987s
user 0m21.722s
sys 0m2.761s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 1m12.603s
user 0m21.473s
sys 0m3.354s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 26515 12 41504 492 1 42 796304 10239934 11 130560 0 3208 12.031 211017 3275 http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1
A previous result for the RUG S1 was taken just an hour before these:
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 2m3.672s
user 0m21.503s
sys 0m3.282s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null
real 26m44.860s
user 0m21.737s
sys 0m3.181s
so natural to assume that something changed on the RUG side when the issue was raised with the network team there...just need to figure out what.
EDIT: Assumption about networking at RUG seems to be wrong, leaving us in the unfortunate position of not having a clue why things improved 😞
As it happens, there seems to have been DDoS attack on CVMFS services around the time we were seeing reduced performance, this may be connected.
@ocaisa You saw the bad performance again today though, right? So it wasn't a temporary fluke?
Yes problem reoccurred today.
PR for this open in https://github.com/EESSI/eessi-demo/pull/24 (not sure if it's the right location, but it works for now)
Last week I gave an EESSI tutorial and running the examples on a vanilla instance on AWS was lightning fast from a cold start. In contrast, my runs inside a fresh Magic Castle cluster I brought up today were very slow, it took 10 minutes for the initial run of Tensorflow (and 36s when repeating the run).
The main difference I can think of is the response time from the different S1's. Is there any way we can do a speedcheck for our Stratum 1's to make sure they are operating as fast as we expect them to?