Cloud-and-Distributed-Systems / Erms

18 stars 4 forks source link

CPU usage range #9

Closed SmartHypercube closed 1 year ago

SmartHypercube commented 1 year ago

I noticed that in this file https://github.com/Cloud-and-Distributed-Systems/Erms/blob/main/AE/data_hotel-reserv/offlineTestResult/latencyByPod.csv , the column cpuUsage is often between 10 and 100. I also noticed that you set some parameters related to this range:

However, after running bash AE/scripts/profiling-testing.sh -a hotel-reserv -s Recommendation on our cluster (which strangely finished 1-20 clients after several minutes, instead of 9 hours as the readme says), the column cpuUsage in latencyByPod.csv is all smaller than 1, which caused following scripts to fail. Is CPU usage range expected when using a different cluster? Should we change the parameters to other values?

Nick-LCY commented 1 year ago

I think your CPU usage is too low and you may need to increase the workload, but that shouldn't be the reason why the script terminates, does the script print any error messages when it terminates?

SmartHypercube commented 1 year ago
  1. According to https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/AE/scripts/profiling-testing.py , The script is not terminating early. There are these levels of loops:

    1. for svc in service: Since I only specified one service, this loop runs once.
    2. for repeat in range(repeats): By default repeat = 1.
    3. for cpuInstance in range(1, cpu + 1): By default cpu = 1.
    4. for memoryInstance in range(1, mem + 1): By default mem = 1.
    5. for clientNum in range(1, 21): The script did successfully looped through 1 client to 20 clients. The output looks the same as https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/doc/img/profiling-testing.png , especially, each iteration took about the same time (less than one minute). So 20 iterations took less than 20 minutes.

    So the script did everything written in it, it didn't "terminate early", and finished after several minutes, instead of 9 hours as the readme says. What do you think went wrong here?

  2. As far as I can see, the workload is the same as yours. This is one row picked from our latencyByPod.csv:

    • microservice: frontend
    • pod: frontend-7fdd676695-4mkzk
    • median: 245.0
    • latency: 407.79999999999995
    • cpuUsage: 0.0314624675549403
    • memUsage: 13.3828125
    • repeat: 0
    • service: Recommendation
    • cpuInter: 0.25
    • memInter: 500
    • targetReqFreq: 56
    • reqFreq: 57.675

    This is one row picked from https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/AE/data_hotel-reserv/offlineTestResult/latencyByPod.csv :

    • pod: frontend-587956c9ff-nnqnl
    • median: 395.5
    • latency: 679.25
    • cpuUsage: 28.526623582498967
    • memUsage: 19.158203125
    • repeat: 0
    • service: Recommendation
    • cpuInter: 0.4
    • memInter: 800.0
    • targetReqFreq: 56
    • reqFreq: 57.575

    Note that I picked the same microservice, same service, and same targetReqFreq (so same workload). I wonder why our cpuUsage is so low comparing to yours. Could you please tell me what is the unit of this value? Is it 0.03 CPU cores in our data and 28.5 CPU cores in your data?

  3. Another question related to the two rows above. I wonder why your cpuInter is 0.4 and memInter is 800. According to https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/AE/scripts/profiling-testing.py#L266-L267 these values are impossible. Is the provided data not generated from the open-sourced code?
Nick-LCY commented 1 year ago
  1. I see the problem, the script is prepared for functional budget, so I only demonstrate that our code can work. The "9 hours" mentioned here is the average time I used for profiling a service, not the time used for this script, I will correct this information. If you are going to profile a service, I recommend using the following configurations:

    • Repeats: At least 3.
    • CPU/Memory interference: 30%, 55%, 80% (of the node's CPU/memory capacity)
    • Clients: 20 is OK.

    So there should be at least 3 (CPU) 3 (memory) 3 (repeats) 20 (clients) 40 sec (duration of a single test case) ≈ 6 hours. Taking into account other time consumed by things like waiting for pods deployment, it will take a total of about 9 hours.

    Besides, in the following code, we define the consumed resources of each type of resource interference, you may need to adjust them based on the resources of your physical machine. Note here, the maximum CPU/memory size of CPU/memory interference shouldn't exceed 1 core/4 GB, if you need to consume more resources, you need to increase the number of instances. https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/AE/scripts/profiling-testing.py#L199-L202

  2. When doing AE, we conduct our experiment with very limited resources of containers: 0.1 cores + 100 MB, so you can either increase the workload: https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/AE/scripts/profiling-testing.py#L143-L186 or reduce the container resources: https://github.com/Cloud-and-Distributed-Systems/Erms/blob/612241f1361d82a2cf813804320a49dfe0522c45/AE/scripts/profiling-testing.py#L192 to get a higher resource utilization.

  3. This is a mistake, they should be the same as the resource limit of the interferences (i.e. 0.4 cores and 400 MB), I'll correct this one. And yes, when profiling the data, I didn't use this script, it's only for functional evaluation, so the profiling data is correct. Actually, the related code is already in this repository, but I don't have time to organize it right now.

SmartHypercube commented 1 year ago

Thank you for your detailed response! It helps a lot.

Nick-LCY commented 12 months ago

Hi, I have added profiling guides to the readme, and I also provide the way I profiled AE data (by editing config files), I can't find the config files I used that time, but that's the way I had done it.