Speed-test/health-check for Stratum-1's

ocaisa commented 1 year ago

Last week I gave an EESSI tutorial and running the examples on a vanilla instance on AWS was lightning fast from a cold start. In contrast, my runs inside a fresh Magic Castle cluster I brought up today were very slow, it took 10 minutes for the initial run of Tensorflow (and 36s when repeating the run).

The main difference I can think of is the response time from the different S1's. Is there any way we can do a speedcheck for our Stratum 1's to make sure they are operating as fast as we expect them to?

ocaisa commented 1 year ago

There is some discussion ongoing for this in Slack, and the (unsurprising) conclusion is that closer you are to the S1 you use, the faster things are. In the case of AWS, we have an S1 in the same zone so we get nice fast speeds. For my Magic Castle instance it has to make plenty of hops to get to RUG...and there may be limitations being imposed by the network.

The Alliance configuration uses a CDN for cases like Magic Castle, and we should probably do something similar. We may even want multiple CDNs: one for use inside Azure, one for AWS, one for everyone else (Cloudflare). Managing CDNs will help us control any associated costs (and boost speed where we can).

terjekv commented 1 year ago

Some timings. AWS VM to AWS S1:

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    0m20.198s
user    0m21.628s
sys     0m3.143s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    0m19.488s
user    0m21.364s
sys     0m3.317s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    0m19.732s
user    0m21.638s
sys     0m3.138s

AWS VM trying to talk to RUG:

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    8m52.507s
user    0m22.010s
sys     0m3.103s

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    19m14.079s
user    0m21.985s
sys     0m3.117s

Nethogs here reports 10KB/s:

 NetHogs version 0.8.7-23-gf281ca3

    PID USER     PROGRAM               DEV         SENT      RECEIVED      
  27525 cvmfs    /usr/bin/cvmfs2       ens5        0.564      10.464 KB/s
  29331 ec2-us.. sshd: ec2-user@pts/1  ens5        0.252       0.103 KB/s
      ? root     unknown TCP           0.000       0.000 KB/s

  TOTAL                                                                                                                                                     0.816      10.567 KB/s

Targeting S0 (also RUG) we also see poor performance. This is after 5m and the cache has been populated with 20MB...

[ec2-user@ip-172-31-1-106 ~]$ cvmfs_config stat

Running /usr/bin/cvmfs_config stat cvmfs-config.cern.ch:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27078 48 31996 22 1 1 20735 10240000 0 130560 0 2 33.333 43 318 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/cvmfs-config.cern.ch DIRECT 1

Running /usr/bin/cvmfs_config stat pilot.eessi-hpc.org:
VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2.10.1.0 27525 40 39020 492 1 3 20735 10240000 43 130560 0 280 47.350 9882 55 http://cvmfs-s0.eessi-infra.org/cvmfs/pilot.eessi-hpc.org DIRECT 1

ocaisa commented 1 year ago

Result from RUG S0 (after S1 tests):

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    50m41.673s
user    0m21.844s
sys     0m3.271s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                       PROXY   ONLINE
2.10.1.0  27525  93         41144   492       3           42          796371       10240001     11       130560   0        3208    12.031      211025  63          http://cvmfs-s0.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1

ocaisa commented 1 year ago

Test from the UiB S1 show consistent results:

[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)  SPEED(K/S)  HOST                                                                    PROXY   ONLINE
2.10.1.0  27525  96         39948   492       3           2           23402        10240000     11       130560   0        267     83.806      12568  3863        http://bgo-no.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m10.015s
user    0m21.881s
sys     0m3.107s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m9.182s
user    0m22.165s
sys     0m2.852s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m11.770s
user    0m22.078s
sys     0m2.867s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                                    PROXY   ONLINE
2.10.1.0  27525  104        41316   492       1           42          796371       10240001     11       130560   0        3208    12.031      211017  1642        http://bgo-no.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1

so it does appear to point to something shaping the network traffic at RUG

ocaisa commented 1 year ago

The total cache needed to run the Tensorflow example is about 800MB

ocaisa commented 1 year ago

@bedroge We need to identify what is causing the traffic issues at RUG as this (likely) also impacts the speed of updates to our S1. Also another reason to push eessi.io so we can start configuring CDN

boegel commented 1 year ago

@ocaisa Can you also mention how you enforce using a particular Stratum-1, just in case others want to do some testing too?

ocaisa commented 1 year ago

Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS:

# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf
# Reconfigure CVMFS 
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

You can then check the (averaged) bandwidth and cache usage with

cvmfs_config stat pilot.eessi-hpc.org | column -t

terjekv commented 1 year ago

I was thinking if we could add this to some monitoring. I'll see what I can come up with, but it's tricky when the jobs take tens of minutes within a container. They may be non-trivial to have timeouts on.

ocaisa commented 1 year ago

We would just do this in the eessi-demo repo with GitHub actions. We do one job per S1, run the example three times and give a time limit to the jobs (say 12 minutes for 3 runs). We just run that every couple of days.

terjekv commented 1 year ago

I did look at that, but from https://docs.github.com/en/site-policy/github-terms/github-terms-for-additional-products-and-features#5-actions-and-packages:

Actions and any elements of the Actions product or service may not be used in violation of the Agreement, the GitHub Acceptable Use Polices, or the GitHub Actions service limitations set forth in the Actions documentation. Additionally, regardless of whether an Action is using self-hosted runners, Actions should not be used for:

My emphasis. This includes:

any activity that places a burden on our servers, where that burden is disproportionate to the benefits provided to users (for example, don't use Actions as a content delivery network or as part of a serverless application, but a low benefit Action could be ok if it’s also low burden); or

if using GitHub-hosted runners, any other activity unrelated to the production, testing, deployment, or publication of the software project associated with the repository where GitHub Actions are used.

See https://medium.com/average-coder/can-you-use-github-actions-for-monitoring-e9c6cfe79ef4 for a neat idea though.

bedroge commented 1 year ago

Basically, I am editing the EESSI configuration to only give one option and then reconfiguring CVMFS:

# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo vi /etc/cvmfs/domain.d/eessi-hpc.org.conf
# Reconfigure CVMFS 
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

You can then check the (averaged) bandwidth and cache usage with

cvmfs_config stat pilot.eessi-hpc.org | column -t

Probably a bit cleaner/easier: you can also make a local config file, in this case that would be /etc/cvmfs/domain.d/eessi-hpc.local (instead of .conf), where you can override that server list parameter (or anything else).

ocaisa commented 1 year ago

Update on the instructions to reproduce:

# Edit the config file to point to a single S1 option, e.g.,
# CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
# Reconfigure CVMFS 
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
# Wipe the cache and run the example (I used Tensorflow from github.com/EESSI/eessi-demo)
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

You can then check the (averaged) bandwidth and cache usage with

cvmfs_config stat pilot.eessi-hpc.org | column -t

ocaisa commented 1 year ago

Something appears to have changed at RUG today and I am no longer seeing performance issues on either S0 or S1 (indeed performance is significantly better than previous best case scenarios):

[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://ssr4cc.hpc.rug.nl/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m14.691s
user    0m21.103s
sys     0m3.339s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m14.965s
user    0m21.537s
sys     0m2.890s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m14.755s
user    0m21.270s
sys     0m3.228s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                PROXY   ONLINE
2.10.1.0  26515  6          39500   492       2           42          796371       10240001     11       130560   0        3208    12.031      211017  3186        http://ssr4cc.hpc.rug.nl/cvmfs/pilot.eessi-hpc.org  DIRECT  1
[EESSI pilot 2021.12] $ echo 'CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"' | sudo tee /etc/cvmfs/domain.d/eessi-hpc.org.local
CVMFS_SERVER_URL="http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/@fqrn@"
[EESSI pilot 2021.12] $ sudo cvmfs_config setup
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m18.020s
user    0m21.655s
sys     0m2.996s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m9.987s
user    0m21.722s
sys     0m2.761s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    1m12.603s
user    0m21.473s
sys     0m3.354s
[EESSI pilot 2021.12] $ cvmfs_config stat pilot.eessi-hpc.org | column -t
VERSION   PID    UPTIME(M)  MEM(K)  REVISION  EXPIRES(M)  NOCATALOGS  CACHEUSE(K)  CACHEMAX(K)  NOFDUSE  NOFDMAX  NOIOERR  NOOPEN  HITRATE(%)  RX(K)   SPEED(K/S)  HOST                                                                    PROXY   ONLINE
2.10.1.0  26515  12         41504   492       1           42          796304       10239934     11       130560   0        3208    12.031      211017  3275        http://rug-nl.stratum1.cvmfs.eessi-infra.org/cvmfs/pilot.eessi-hpc.org  DIRECT  1

A previous result for the RUG S1 was taken just an hour before these:

[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    2m3.672s
user    0m21.503s
sys     0m3.282s
[EESSI pilot 2021.12] $ sudo cvmfs_config wipecache >& /dev/null; ./run.sh > /dev/null

real    26m44.860s
user    0m21.737s
sys     0m3.181s

so natural to assume that something changed on the RUG side when the issue was raised with the network team there...just need to figure out what.

EDIT: Assumption about networking at RUG seems to be wrong, leaving us in the unfortunate position of not having a clue why things improved 😞

ocaisa commented 1 year ago

As it happens, there seems to have been DDoS attack on CVMFS services around the time we were seeing reduced performance, this may be connected.

boegel commented 1 year ago

@ocaisa You saw the bad performance again today though, right? So it wasn't a temporary fluke?

ocaisa commented 1 year ago

Yes problem reoccurred today.

ocaisa commented 1 year ago

PR for this open in https://github.com/EESSI/eessi-demo/pull/24 (not sure if it's the right location, but it works for now)

EESSI / filesystem-layer

Speed-test/health-check for Stratum-1's #151