amaltaro commented 1 year ago

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. As part of running WMAgent in a container environment, composed with database containers as well. We need to perform load/stress tests to evaluate the performance of CouchDB container.

Describe the solution you'd like Come up with a reliable and meaningful setup to evaluate the performance (latency and throughput, etc) of CouchDB in two deployment modes:

deploying CouchDB directly on top of the OS (through the cmsdist RPMs)
deploying CouchDB through a Docker image (result from https://github.com/dmwm/WMCore/issues/11312)

To be provided with this issue:

results from the evaluation and a final decision on whether it suits WMAgent needs
perhaps a set of scripts and test data to be persisted in the repository, such that we can re-use them in the future

Describe alternatives you've considered None

Additional context Depends on: https://github.com/dmwm/WMCore/issues/11312 Part of the following meta issue: https://github.com/dmwm/WMCore/issues/11314

vkuznet commented 5 months ago

For tests below I used common CouchDB version obtained from their official site and docker hub. The version is 3.3.3

Docker setup

install docker container docker pull couchdb
create common database area, e.g. mkdir /path/data
run docker container docker run -e COUCHDB_USER=admin -e COUCHDB_PASSWORD=password -p 5984:5984 --volume /path/data:/opt/couchdb/data couchdb

CouchDB initialization

create necessary couch DBs:

curl -X PUT http://admin:password@127.0.0.1:5984/_users
curl -X PUT http://admin:password@127.0.0.1:5984/_replicator
curl -X PUT http://admin:password@127.0.0.1:5984/_global_changes
curl -X PUT http://admin:password@127.0.0.1:5984/test

Load/stress tests

prepare JSON documents to inject concurrently
- take existing WM JSON from https://cmsweb.cern.ch/reqmgr2 service and name it as wm.json

perform load/stress test using hey tool [2]


# inject first document
curl -X POST http://admin:password@127.0.0.1:5984/test -H "Content-Type: application/json" -d@/wma/vk/CouchDB/wm.json
{"ok":true,"id":"e05f6d85f7fe72b02709decb2a000c58","rev":"1-be003cb86351088da81258cd77a55ea4"}

perform load/stress test

hey -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /path/wm.json -disable-keepalive /afs/cern.ch/user/v/valya/public/hey_linux -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /wma/vk/CouchDB/wm.json -disable-keepalive -disable-compression http://admin:password@localhost:5984/test 2>&1 1>& log


Results are the following:

69 requests done. 182 requests done. All requests done.

Summary: Total: 1.0729 secs Slowest: 0.4308 secs Fastest: 0.0258 secs Average: 0.2296 secs Requests/sec: 186.4051 Total data: 19000 bytes Size/request: 95 bytes

Status code distribution: [201] 200 responses

Response time histogram: 0.026 [1] |∎ 0.066 [31] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.107 [0] | 0.147 [12] |∎∎∎∎∎∎∎∎∎∎∎ 0.188 [8] |∎∎∎∎∎∎∎ 0.228 [44] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.269 [36] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.309 [17] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.350 [17] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.390 [27] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.431 [7] |∎∎∎∎∎∎

Latency distribution: 10% in 0.0424 secs 25% in 0.1867 secs 50% in 0.2357 secs 75% in 0.3330 secs 90% in 0.3828 secs 95% in 0.3859 secs 99% in 0.4307 secs


### Local CouchDB setup

installation instructions

sudo yum install -y yum-utils sudo yum-config-manager --add-repo https://couchdb.apache.org/repo/couchdb.repo sudo dnf config-manager --set-enabled crb sudo dnf install epel-release epel-next-release sudo yum install -y mozjs78 sudo yum install -y couchdb

initial setup, see

https://docs.couchdb.org/en/latest/install/unix.html#installation-using-the-apache-couchdb-convenience-binary-packages

enable admin login name and password

sudo vim /opt/couchdb/etc/local.ini

start local couchdb

sudo -i -u couchdb /opt/couchdb/bin/couchdb

At this step I performed `CouchDB initialization` steps to setup _users and other dbs.
Finally, I repeat steps listed in `Load/stress tests` and got the following results:

All requests done.

Summary: Total: 0.0543 secs Slowest: 0.0227 secs Fastest: 0.0045 secs Average: 0.0123 secs Requests/sec: 3683.9033 Total data: 19000 bytes Size/request: 95 bytes

Status code distribution: [201] 200 responses

Response time histogram: 0.004 [1] |∎ 0.006 [9] |∎∎∎∎∎∎∎∎∎ 0.008 [20] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.010 [32] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.012 [37] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.014 [18] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.015 [42] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.017 [28] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.019 [2] |∎∎ 0.021 [3] |∎∎∎ 0.023 [8] |∎∎∎∎∎∎∎∎

Latency distribution: 10% in 0.0077 secs 25% in 0.0093 secs 50% in 0.0118 secs 75% in 0.0151 secs 90% in 0.0163 secs 95% in 0.0206 secs 99% in 0.0225 secs



### Results comparison
How it can be see from above benchmark results we have the following:
- local CouchDB
  - almost 3700 request per second for POST HTTP calls
- docker CouchDB
  - below 200 request per second for POST HTTP calls

### TODO
- setup both instances, i.e. local CouchDB and docker CouchDB and run tests outside of the node, i.e. run hey tool from another node to add more network latency
- try different http benchmarks, i.e. POST and GET
- change number of concurrent calls

### References
[1] https://hub.docker.com/_/couchdb/
[2] https://github.com/rakyll/hey or
    use vkuznet port patched to support X509: https://github.com/vkuznet/hey

amaltaro commented 5 months ago

Scary results! However, I would highly recommend to use the products that we will actually be using, instead of using upstream ones.

That said, I would suggest to test the current COMP RPM couchdb package against the wmagent-couchdb in CMSKubernetes. Ideally these tests should be performed in the very same environment as well (including the node). Otherwise it is a hard comparison to digest.

vkuznet commented 5 months ago

I setup CouchDB (3.2.2) on one of my VM using COMP RPMs, in fact I simply used rsync to copy /data/srv area from one of the WMAgent CERN nodes. Then, I rerun the hey test against local CouchDB. The results was 2318 req/sec. Then, using wmagent-couchdb image I got similar results 2379 req/sec.

To sum up:

first test was done using stock CouchDB 3.3.3 on RH9 nodes
results:
- using local couchDB almost 3700 req/sec
- using couchdb image below 200 req/sec
second test was done using COMP RPM CouchDB 3.2.2 and wmagent-couchdb image on CC7 node
results:
- using local CouchDB we got 2318 req/sec
- using wmagent-couchdb docker image we got 1973 req/sec

At this step, I do not know if slowness of stock CouchDB image (3.3.3) is due to running it on RH9 or because of image content itself. But I'm happy to see comparable performance on CC7 using local CouchDB installed via RPM and our wmagent-couchdb docker image.

vkuznet commented 5 months ago

Another test performed with wmagent-couchdb image on RH9 node:

docker image without host network


# CouchDB server
docker run -it -e COUCHDB_USER=admin -e COUCHDB_PASSWORD=password -p 5984:5984 --volume /wma/vk/CouchDB/data:/opt/couchdb/data -v /wma/vk/secrets:/data/admin/wmagent registry.cern.ch/cmsweb/wmagent-couchdb

hey client

/afs/cern.ch/user/v/valya/public/hey_linux -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /wma/vk/CouchDB/wm.json -disable-keepalive -disable-compression http://admin:password@localhost:5984/test

Results (3 iteration of hey client):
- 1282 req/sec
- 1620 req/sec
- 1640 req/sec

Average: 1514 req/sec

2. docker image with host network (use `--host=net` option):

CouchDB server

docker run --host=net -it -e COUCHDB_USER=admin -e COUCHDB_PASSWORD=password -p 5984:5984 --volume /wma/vk/CouchDB/data:/opt/couchdb/data -v /wma/vk/secrets:/data/admin/wmagent registry.cern.ch/cmsweb/wmagent-couchdb

hey client

/afs/cern.ch/user/v/valya/public/hey_linux -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /wma/vk/CouchDB/wm.json -disable-keepalive -disable-compression http://admin:password@localhost:5984/test


Results (3 iteration of hey client):
- 1657 req/sec
- 1953 req/sec
- 1832 req/sec

Average: 1814 req/sec

### Summary
There is a slight degradation of performance (based on HTTP POST requests) between container network and using host network where the later provides more throughput. But in both cases the performance is quite decent, above 1500 req/sec, and I doubt the difference (around 300 req/sec higher in case of host network) will make any impact on WM operations.

vkuznet commented 5 months ago

@amaltaro do you have any other suggestions for testing based on providing results?

amaltaro commented 5 months ago

That is very good, thanks Valentin! Do I understand it right that you tested both GET and POST calls to CouchDB?

I think it's important to keep track of this evaluation and the results in our wmcore-docs repository. Please also be explicitly with the:

environment (architecture/OS)
package source (stock couch, cms couch, cms rpm, etc)
type of HTTP methods tested
which url/test was performed (I understand you were fetching database summary curl .../db_name, right?)
and perhaps a table with the results.

anpicci commented 5 months ago

@vkuznet @vkuznet I would suggest keeping an eye on the progress of the issue #11635 to check if it has an impact on the current tests documented in this issue

vkuznet commented 5 months ago

Here is another summary in table format, concurrency -n 200 -c 50 means 200 requests using 50 clients:

using wmagent-couch docker image

iteration	Couch setup	Linux OS	deployment	Test method	concurrency	Req/sec
round 1	No host	RH9	image	POST	-n 200 -c 50	1198
round 1	No host	RH9	image	GET	-n 200 -c 50	1601
round 2	No host	RH9	image	POST	-n 200 -c 50	858
round 2	No host	RH9	image	GET	-n 200 -c 50	1688
round 3	No host	RH9	image	POST	-n 200 -c 50	1019
round 3	No host	RH9	image	GET	-n 200 -c 50	1751
---	---	---	---	---	---	---
average	No host	RH9	image	POST	-n 200 -c 50	1025
average	No host	RH9	image	GET	-n 200 -c 50	1680
---	---	---	---	---	---	---
round 1	host network	RH9	image	POST	-n 200 -c 50	1636
round 1	host network	RH9	image	GET	-n 200 -c 50	3013
round 2	host network	RH9	image	POST	-n 200 -c 50	2355
round 2	host network	RH9	image	GET	-n 200 -c 50	3597
round 3	host network	RH9	image	POST	-n 200 -c 50	2390
round 3	host network	RH9	image	GET	-n 200 -c 50	2908
---	---	---	---	---	---	---
average	host network	RH9	image	POST	-n 200 -c 50	2127
average	host network	RH9	image	GET	-n 200 -c 50	3172

using RPM deployment on CC7

iteration	Couch setup	Linux OS	deployment	Test method	concurrency	Req/sec
round 1	host	CC7	RPM	POST	-n 200 -c 50	2131
round 1	host	CC7	RPM	GET	-n 200 -c 50	2545
round 2	host	CC7	RPM	POST	-n 200 -c 50	2246
round 2	host	CC7	RPM	GET	-n 200 -c 50	3343
round 3	host	CC7	RPM	POST	-n 200 -c 50	2715
round 3	host	CC7	RPM	GET	-n 200 -c 50	3593
---	---	---	---	---	---	---
average	No host	CC7	RPM	POST	-n 200 -c 50	2364
average	No host	CC7	RPM	GET	-n 200 -c 50	3160

Summary

Based on provided results we see little difference in average numbers between RPM and docker image using host network deployment. But using docker image without host network degrades performance of both POST and TEST request by factor of 2 on RH9 host.

References

Here is a shell script used to generate all tests above:

#!/bin/bash
curl -X DELETE http://login:password@127.0.0.1:5984/test
curl -X PUT http://login:password@127.0.0.1:5984/test
file=/afs/cern.ch/user/v/valya/public/wm.json

# insert one document and get its document id
did=`curl -s -X POST http://login:password@127.0.0.1:5984/test -H "Content-Type: application/json" -d@$file | jq '.id'`

# perform POST tests
echo "POST test"
/afs/cern.ch/user/v/valya/public/hey_linux \
    -n 200 -c 50 -m POST \
    -H "Content-Type: application/json" \
    -D $file \
    -disable-keepalive \
    -disable-compression \
    http://login:password@localhost:5984/test 2>&1 1>& couch-test-post.log
grep "Requests/sec" couch-test-post.log

# perform get tests
echo ""
echo "GET test with id=$did"
/afs/cern.ch/user/v/valya/public/hey_linux \
    -n 200 -c 50 -m GET \
    -disable-keepalive \
    -disable-compression \
    "http://login:password@localhost:5984/test/$did" 2>&1 1>& couch-test-get.log
grep "Requests/sec" couch-test-get.log

vkuznet commented 5 months ago

Alan, I addressed in a table above all your requests, do you need anything else here?

amaltaro commented 5 months ago

Thanks Valentin.

As mentioned in this comment: https://github.com/dmwm/WMCore/issues/11567#issuecomment-2058072321 let us know persist it in the official wmcore-docs documentation not to get this precious information lost in github tickets.

vkuznet commented 5 months ago

Done, please see https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/30

amaltaro commented 4 months ago

From the table above, performance difference is close to 0 for GET requests, while RPM based has around 10% speed up for POST requests. My conclusion is that containerized CouchDB will have no performance impact. Thank you for the documentation and evaluation, Valentin. Closing this out.

dmwm / WMCore

Evaluate performance of CouchDB in host vs container mode #11567

Docker setup

CouchDB initialization

Load/stress tests

perform load/stress test

installation instructions

initial setup, see

https://docs.couchdb.org/en/latest/install/unix.html#installation-using-the-apache-couchdb-convenience-binary-packages

enable admin login name and password

start local couchdb

hey client

CouchDB server

hey client

using wmagent-couch docker image

using RPM deployment on CC7

Summary

References