Open stephanme opened 4 years ago
Another metrics screenshot including memory. CPU could be caused by garbage collection activity. Note that meanwhile CC api VMs run on 4 core machines (before: 2 core). This explains why CPU is now lower. 100% CPU = all cores of the VM are fully used, one ruby process can use only one core (as Ruby works single threaded).
We've set up a Cloud Foundry landscape with bits-service on AliCloud to check which load the bits-service can handle. Test setup:
cf-deployment v12.39.0
All CF VM sizes are vm_2cpu_4gb (except for Diego cells which have vm_2cpu_8gb)
Number of "api" nodes: 4 Number of "api_workers": 2 Number of "bits-services": 2
We've pushed a number of test applications which contain larger resource files. The content of the files is randomly generated to make sure that no caching influences the result. Here is the test script:
#!/bin/bash
trap "kill 0" EXIT
APP_DOMAIN="<TODO> insert app domain"
NUMBER_APPS=10
RESOURCE_FILE_SIZE_MB=100
function push_app() (
while :
do
echo "$1 $(date) Pushing cf app test$i ..."
mkdir -p test$i
pushd test$i > /dev/null
touch Staticfile
echo "Hello, World $i" > index.html
rm resource$i || true
dd if=/dev/urandom of=resource$i bs=1M count=$RESOURCE_FILE_SIZE_MB
cf push test$i --hostname test$i -d $APP_DOMAIN -m 64M > /dev/null
popd > /dev/null
done
)
for ((i = 1 ; i <= $NUMBER_APPS ; i++));
do
echo "Starting subprocess $i"
push_app $i &
done
echo "Waiting for ctrl c"
wait
To run the script, log on to CF and create a new space. Insert the domain name for the test apps (see TODO above), adapt number of apps and size of resource file and execute the script. It will push CF apps in parallel "for" loops.
Our test results:
NUMBER_APPS | RESOURCE_FILE_SIZE_MB | CC overall requests completed per sec | CC average CPU load | CC average memory |
---|---|---|---|---|
10 | 10 MB | 19 | 12% | 33% |
10 | 100 MB | 10 | 10% | 37% |
30 | 10 MB | 35 | 17% | 37% |
30 | 100 MB | 20 | 21% | 40% |
50 | 10 MB | 45 | 23% | 38% |
50 | 100 MB | 25 | 23% | 40% |
We did not encounter any CC problems during the test runs. All CF pushes were completed successfully. Cloud Controller was not overloaded and the two bits-service instances were also healthy.
We will repeat the test series with CC and fog-aliyun so that we can compare numbers.
Repeat the test with Fog-aliyun, here ist the result:
NUMBER_APPS | RESOURCE_FILE_SIZE_MB | CC overall requests completed per sec | CC average CPU load | CC average memory |
---|---|---|---|---|
10 | 10MB | 5 | 36% | 73% |
10 | 100MB | 5 | 39% | 80% |
30 | 10MB | (timeout connecting to log server, no log will be shown)/The app is invalid: VCAP::CloudController::BuildCreate::StagingInProgress/cf apps: Server error, status code: 502, error code: 0, message: |
The test results above were measured with fog-aliyun 0.3.10 not yet with the improved 0.3.11 that addresses the memory issues.
We've repeated the load tests with fog-aliyun 0.3.15.
Test setup: CF v12.39.0 with 4 api and 2 cc-worker nodes of type "vm_2cpu_4gb". Load test runs #NUMBER_APPS parallel "cf push" processes. Each app has a randomly generated resource file of size RESOURCE_FILE_SIZE_MB.
AliCloud landscape with fog-aliyun 0.3.15:
NUMBER_APPS | RESOURCE_FILE_SIZE_MB | CC overall req/sec | CC avg CPU load | CC avg memory | Worker avg CPU | Worker avg memory |
---|---|---|---|---|---|---|
10 | 10 MB | 7 | 43% | 75% | 34% | 72% |
10 | 100 MB | 7 | 38% | 75% | 26% | 72% |
30 | 10 MB | 12 | 66% | 82% | 40% | 72% |
30 | 100 MB | 17 | 79% | 80% | 51% | 74% |
50 | 10 MB | 15 | 75% | 80% | 92% | 75% |
For the last two test runs, the landscape starts to fail (CC becoming unresponsive). In our dashboards we could see that the following CC worker jobs queued up: BlobstoreDelete, DeleteExpiredDropletBlob, DeleteExpiredPackageBlob
To get numbers for comparison, we've run the same tests on a AWS landscape with fog-aws 2.0.1:
NUMBER_APPS | RESOURCE_FILE_SIZE_MB | CC overall req/sec | CC avg CPU load | CC avg memory | Worker avg CPU | Worker avg memory |
---|---|---|---|---|---|---|
10 | 10 MB | 17 | 21% | 40% | 19% | 42% |
10 | 100 MB | 12 | 33% | 42% | 17% | 42% |
30 | 10 MB | 43 | 48% | 43% | 23% | 43% |
30 | 100 MB | 30 | 60% | 44% | 18% | 44% |
50 | 10 MB | 55 | 66% | 44% | 22% | 44% |
50 | 100 MB | 50 | 65% | 44% | 22% | 44% |
We can see that the memory usage is almost constant. CPU usage increases, but the landscape was still responsive.
We've repeated the performance tests with fog-aliyun 0.3.17 and the same setup as described in the previous comment:
NUMBER_APPS | RESOURCE_FILE_SIZE_MB | CC overall req/sec | CC avg CPU load | CC avg memory | Worker avg CPU | Worker avg memory |
---|---|---|---|---|---|---|
10 | 10 MB | 17 | 18% | 45% | 16% | 52% |
10 | 100 MB | 11 | 18% | 47% | 15% | 52% |
30 | 10 MB | 11 | 39% | 46% | 17% | 52% |
30 | 100 MB | 27 | 26% | 45% | 25% | 70% |
50 | 10 MB | 58 | 51% | 46% | 18% | 69% |
50 | 100 MB | 50 | 41% | 46% | 17% | 69% |
Worker memory load is a little higher than on AWS, but the other numbers are comparable or even better.
@jochenehret Thanks for your feedback. We will continue to improve it performance and the next version will resolve "Worker memory load is a little higher than on AWS" issue.
When switching in a CF landscape with "some usage" the cloud controller from bits-service to fog, we observed a drastic increase in CPU usage of the ruby processes of the Cloud Controller (see attached monitoring screenshot). Such an increase was not observed on AWS, Azure and GCP where we switched from bits to fog as well. The CF usage (# of cf-push operations) should be comparable between the different IaaS.