dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Evaluate performance of CouchDB in host vs container mode #11567

Closed amaltaro closed 4 months ago

amaltaro commented 1 year ago

Impact of the new feature WMAgent

Is your feature request related to a problem? Please describe. As part of running WMAgent in a container environment, composed with database containers as well. We need to perform load/stress tests to evaluate the performance of CouchDB container.

Describe the solution you'd like Come up with a reliable and meaningful setup to evaluate the performance (latency and throughput, etc) of CouchDB in two deployment modes:

To be provided with this issue:

Describe alternatives you've considered None

Additional context Depends on: https://github.com/dmwm/WMCore/issues/11312 Part of the following meta issue: https://github.com/dmwm/WMCore/issues/11314

vkuznet commented 5 months ago

For tests below I used common CouchDB version obtained from their official site and docker hub. The version is 3.3.3

Docker setup

CouchDB initialization

Load/stress tests

perform load/stress test

hey -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /path/wm.json -disable-keepalive /afs/cern.ch/user/v/valya/public/hey_linux -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /wma/vk/CouchDB/wm.json -disable-keepalive -disable-compression http://admin:password@localhost:5984/test 2>&1 1>& log


Results are the following:

69 requests done. 182 requests done. All requests done.

Summary: Total: 1.0729 secs Slowest: 0.4308 secs Fastest: 0.0258 secs Average: 0.2296 secs Requests/sec: 186.4051 Total data: 19000 bytes Size/request: 95 bytes

Status code distribution: [201] 200 responses

Response time histogram: 0.026 [1] |∎ 0.066 [31] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.107 [0] | 0.147 [12] |∎∎∎∎∎∎∎∎∎∎∎ 0.188 [8] |∎∎∎∎∎∎∎ 0.228 [44] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.269 [36] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.309 [17] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.350 [17] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.390 [27] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.431 [7] |∎∎∎∎∎∎

Latency distribution: 10% in 0.0424 secs 25% in 0.1867 secs 50% in 0.2357 secs 75% in 0.3330 secs 90% in 0.3828 secs 95% in 0.3859 secs 99% in 0.4307 secs


### Local CouchDB setup

installation instructions

sudo yum install -y yum-utils sudo yum-config-manager --add-repo https://couchdb.apache.org/repo/couchdb.repo sudo dnf config-manager --set-enabled crb sudo dnf install epel-release epel-next-release sudo yum install -y mozjs78 sudo yum install -y couchdb

initial setup, see

https://docs.couchdb.org/en/latest/install/unix.html#installation-using-the-apache-couchdb-convenience-binary-packages

enable admin login name and password

sudo vim /opt/couchdb/etc/local.ini

start local couchdb

sudo -i -u couchdb /opt/couchdb/bin/couchdb

At this step I performed `CouchDB initialization` steps to setup _users and other dbs.
Finally, I repeat steps listed in `Load/stress tests` and got the following results:

All requests done.

Summary: Total: 0.0543 secs Slowest: 0.0227 secs Fastest: 0.0045 secs Average: 0.0123 secs Requests/sec: 3683.9033 Total data: 19000 bytes Size/request: 95 bytes

Status code distribution: [201] 200 responses

Response time histogram: 0.004 [1] |∎ 0.006 [9] |∎∎∎∎∎∎∎∎∎ 0.008 [20] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.010 [32] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.012 [37] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.014 [18] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.015 [42] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.017 [28] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 0.019 [2] |∎∎ 0.021 [3] |∎∎∎ 0.023 [8] |∎∎∎∎∎∎∎∎

Latency distribution: 10% in 0.0077 secs 25% in 0.0093 secs 50% in 0.0118 secs 75% in 0.0151 secs 90% in 0.0163 secs 95% in 0.0206 secs 99% in 0.0225 secs



### Results comparison
How it can be see from above benchmark results we have the following:
- local CouchDB
  - almost 3700 request per second for POST HTTP calls
- docker CouchDB
  - below 200 request per second for POST HTTP calls

### TODO
- setup both instances, i.e. local CouchDB and docker CouchDB and run tests outside of the node, i.e. run hey tool from another node to add more network latency
- try different http benchmarks, i.e. POST and GET
- change number of concurrent calls

### References
[1] https://hub.docker.com/_/couchdb/
[2] https://github.com/rakyll/hey or
    use vkuznet port patched to support X509: https://github.com/vkuznet/hey
amaltaro commented 5 months ago

Scary results! However, I would highly recommend to use the products that we will actually be using, instead of using upstream ones.

That said, I would suggest to test the current COMP RPM couchdb package against the wmagent-couchdb in CMSKubernetes. Ideally these tests should be performed in the very same environment as well (including the node). Otherwise it is a hard comparison to digest.

vkuznet commented 5 months ago

I setup CouchDB (3.2.2) on one of my VM using COMP RPMs, in fact I simply used rsync to copy /data/srv area from one of the WMAgent CERN nodes. Then, I rerun the hey test against local CouchDB. The results was 2318 req/sec. Then, using wmagent-couchdb image I got similar results 2379 req/sec.

To sum up:

At this step, I do not know if slowness of stock CouchDB image (3.3.3) is due to running it on RH9 or because of image content itself. But I'm happy to see comparable performance on CC7 using local CouchDB installed via RPM and our wmagent-couchdb docker image.

vkuznet commented 5 months ago

Another test performed with wmagent-couchdb image on RH9 node:

  1. docker image without host network
    
    # CouchDB server
    docker run -it -e COUCHDB_USER=admin -e COUCHDB_PASSWORD=password -p 5984:5984 --volume /wma/vk/CouchDB/data:/opt/couchdb/data -v /wma/vk/secrets:/data/admin/wmagent registry.cern.ch/cmsweb/wmagent-couchdb

hey client

/afs/cern.ch/user/v/valya/public/hey_linux -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /wma/vk/CouchDB/wm.json -disable-keepalive -disable-compression http://admin:password@localhost:5984/test

Results (3 iteration of hey client):
- 1282 req/sec
- 1620 req/sec
- 1640 req/sec

Average: 1514 req/sec

2. docker image with host network (use `--host=net` option):

CouchDB server

docker run --host=net -it -e COUCHDB_USER=admin -e COUCHDB_PASSWORD=password -p 5984:5984 --volume /wma/vk/CouchDB/data:/opt/couchdb/data -v /wma/vk/secrets:/data/admin/wmagent registry.cern.ch/cmsweb/wmagent-couchdb

hey client

/afs/cern.ch/user/v/valya/public/hey_linux -n 200 -c 50 -m POST -H "Content-Type: application/json" -D /wma/vk/CouchDB/wm.json -disable-keepalive -disable-compression http://admin:password@localhost:5984/test


Results (3 iteration of hey client):
- 1657 req/sec
- 1953 req/sec
- 1832 req/sec

Average: 1814 req/sec

### Summary
There is a slight degradation of performance (based on HTTP POST requests) between container network and using host network where the later provides more throughput. But in both cases the performance is quite decent, above 1500 req/sec, and I doubt the difference (around 300 req/sec higher in case of host network) will make any impact on WM operations.
vkuznet commented 5 months ago

@amaltaro do you have any other suggestions for testing based on providing results?

amaltaro commented 5 months ago

That is very good, thanks Valentin! Do I understand it right that you tested both GET and POST calls to CouchDB?

I think it's important to keep track of this evaluation and the results in our wmcore-docs repository. Please also be explicitly with the:

anpicci commented 5 months ago

@vkuznet @vkuznet I would suggest keeping an eye on the progress of the issue #11635 to check if it has an impact on the current tests documented in this issue

vkuznet commented 5 months ago

Here is another summary in table format, concurrency -n 200 -c 50 means 200 requests using 50 clients:

using wmagent-couch docker image

iteration Couch setup Linux OS deployment Test method concurrency Req/sec
round 1 No host RH9 image POST -n 200 -c 50 1198
round 1 No host RH9 image GET -n 200 -c 50 1601
round 2 No host RH9 image POST -n 200 -c 50 858
round 2 No host RH9 image GET -n 200 -c 50 1688
round 3 No host RH9 image POST -n 200 -c 50 1019
round 3 No host RH9 image GET -n 200 -c 50 1751
--- --- --- --- --- --- ---
average No host RH9 image POST -n 200 -c 50 1025
average No host RH9 image GET -n 200 -c 50 1680
--- --- --- --- --- --- ---
round 1 host network RH9 image POST -n 200 -c 50 1636
round 1 host network RH9 image GET -n 200 -c 50 3013
round 2 host network RH9 image POST -n 200 -c 50 2355
round 2 host network RH9 image GET -n 200 -c 50 3597
round 3 host network RH9 image POST -n 200 -c 50 2390
round 3 host network RH9 image GET -n 200 -c 50 2908
--- --- --- --- --- --- ---
average host network RH9 image POST -n 200 -c 50 2127
average host network RH9 image GET -n 200 -c 50 3172

using RPM deployment on CC7

iteration Couch setup Linux OS deployment Test method concurrency Req/sec
round 1 host CC7 RPM POST -n 200 -c 50 2131
round 1 host CC7 RPM GET -n 200 -c 50 2545
round 2 host CC7 RPM POST -n 200 -c 50 2246
round 2 host CC7 RPM GET -n 200 -c 50 3343
round 3 host CC7 RPM POST -n 200 -c 50 2715
round 3 host CC7 RPM GET -n 200 -c 50 3593
--- --- --- --- --- --- ---
average No host CC7 RPM POST -n 200 -c 50 2364
average No host CC7 RPM GET -n 200 -c 50 3160

Summary

Based on provided results we see little difference in average numbers between RPM and docker image using host network deployment. But using docker image without host network degrades performance of both POST and TEST request by factor of 2 on RH9 host.

References

Here is a shell script used to generate all tests above:

#!/bin/bash
curl -X DELETE http://login:password@127.0.0.1:5984/test
curl -X PUT http://login:password@127.0.0.1:5984/test
file=/afs/cern.ch/user/v/valya/public/wm.json

# insert one document and get its document id
did=`curl -s -X POST http://login:password@127.0.0.1:5984/test -H "Content-Type: application/json" -d@$file | jq '.id'`

# perform POST tests
echo "POST test"
/afs/cern.ch/user/v/valya/public/hey_linux \
    -n 200 -c 50 -m POST \
    -H "Content-Type: application/json" \
    -D $file \
    -disable-keepalive \
    -disable-compression \
    http://login:password@localhost:5984/test 2>&1 1>& couch-test-post.log
grep "Requests/sec" couch-test-post.log

# perform get tests
echo ""
echo "GET test with id=$did"
/afs/cern.ch/user/v/valya/public/hey_linux \
    -n 200 -c 50 -m GET \
    -disable-keepalive \
    -disable-compression \
    "http://login:password@localhost:5984/test/$did" 2>&1 1>& couch-test-get.log
grep "Requests/sec" couch-test-get.log
vkuznet commented 5 months ago

Alan, I addressed in a table above all your requests, do you need anything else here?

amaltaro commented 5 months ago

Thanks Valentin.

As mentioned in this comment: https://github.com/dmwm/WMCore/issues/11567#issuecomment-2058072321 let us know persist it in the official wmcore-docs documentation not to get this precious information lost in github tickets.

vkuznet commented 5 months ago

Done, please see https://gitlab.cern.ch/dmwm/wmcore-docs/-/merge_requests/30

amaltaro commented 4 months ago

From the table above, performance difference is close to 0 for GET requests, while RPM based has around 10% speed up for POST requests. My conclusion is that containerized CouchDB will have no performance impact. Thank you for the documentation and evaluation, Valentin. Closing this out.