Analyse frontend and other services CPU/RAM usage

zelima commented 6 years ago

As a datahub team, we want to know if any of our services (esp frontend and specstore) is exceeding the limit of allocated CPU or RAM, so that we know for sure services were not down due to that fact, or if they were, adjust/allocate more, so that that does not repeat.

Acceptance Criteria

[x] No service is down due to not enough CPU/RAM/Memory
[x] We know how much CPU is used by each service (average)
[x] We know if any of the services exceeded the limit and know the time if/when that happened

Tasks

[x] Do an analysis
[x] adjust resources for services if necessary
- [x] Resources for the testing cluster are over-allocated. We can downgrade them to have 3 nodes with 1 CPU each (or even 2). We can use those resources for production.
- [x] Dedicate 1 node specifically only for specstore
- [x] Increase CPU usage for that node
- [x] Increase memory
- [ ] Double replicas for specstore.
- [ ] Investigate the reason the disk space is constantly increasing (temporary files are not deleted?)

Analysis

Set up the dashboard with some useful charts, using the Google Stackdrive app, for us to analyze how our services are performing.

We have 2 kubernetes clusters with 3 nodes 2 * 3.75 GBvCPUs each, making 22.50 GB of total memory - one for testing environment and another for production.

Testing VS production

Looking at the testing and production graphs seperatly:

Max CPU usage
- for production in 2000 ms/sec,
- while ~200 ms/sec for testing
Max CPU utilizarion in %:
- production - 100% (quite friquently).
- testing - 11% max

Specstore VS other services

We have problem with alocating resources for specstore service. While it has alocated 1200 of CPU and 2000 Mi of memory it still seems not enough. It is using >80% of CPU quite often and ussually we result in errors while processing datasets. Here you can see the graph of CPU useage by specstore (Three different collors = 3 re-deoploy of specstore due to it's being stuck). As you an see the other services are barelly seen on the graph.

gke container - cpu usage for datahub-production

Another interesting graph. Here you can see the disk space usage is constantly going up until servcie is redeployed for various reasons.

gke container - disk usage for datahub-production

This makes me thinkg that we are saving stuff on the disc an never deleting for some reasons. Eg processing of this dataset crashed due to memory lack: https://api.datahub.io/source/sports-data/atp-world-tour-tennis-data/2

Conclusion and solution

Resources for the testing cluster are over-allocated. We can downgrade them to have 3 nodes with 1 CPU each (or even 2). We can use those resources for production.
Dedicate 1 node specifically only for specstore 2.1 Increase CPU usage for that node 2.2 Increase memory
Double replicas for specstore.
Investigate the reason the disk space is constantly increasing (temporary files are not deleted?)

akariv commented 6 years ago

@zelima agree with 1, 2 & 4. Would wait with 3 until we understand better the rest of the issues.

zelima commented 6 years ago

@akariv OK.

re 1: looking into Gcloud docs seems like there's no easy way to decrease the number of cores per node. we can only increase/decrease the number of nodes in the cluster. to visualize:

we can decrease the number of nodes and remain number of cores, Eg: have 2 nodes with 2 cores each
but we can not decrease number of cores and remain number of nodes as is Eg: 3 nodes with 1 core each

If we really want it that way we can create a new cluster with 3 nodes and 1 core each, delete old one and deploy services there.

Alternatively, we can enable the Cluster Autoscaling feature and let it take care of efficiently allocating resources. https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

Kubernetes Engine's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, Kubernetes Engine automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, Kubernetes Engine can delete the node.

What do you think?

zelima commented 6 years ago

Analysing the disk space and usage for specstore:

Total disk space

df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  94.3G     21.6G     72.6G  23% /
tmpfs                     3.7G         0      3.7G   0% /dev
tmpfs                     3.7G         0      3.7G   0% /sys/fs/cgroup
/dev/sda1                94.3G     21.6G     72.6G  23% /dev/termination-log
/dev/sda1                94.3G     21.6G     72.6G  23% /etc/resolv.conf
/dev/sda1                94.3G     21.6G     72.6G  23% /etc/hostname
/dev/sda1                94.3G     21.6G     72.6G  23% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                     3.7G     12.0K      3.7G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     3.7G         0      3.7G   0% /proc/kcore
tmpfs                     3.7G         0      3.7G   0% /proc/timer_list
tmpfs                     3.7G         0      3.7G   0% /sys/firmware

size of /tmp folder

du -h /tmp

...
4.0K    /tmp/tmp4kah28oh
4.0K    /tmp/tmpng5hq1jn
7.5G    /tmp

Did a bit of research on how /tmp directory is cleaned up and it seems it done only on reboots. See this comments. Since after booting up container is never rebooted, I guess the files are remaining there forever???

The files themselves in /tmp dir are all kind - .zip, .csv, .json and md. You can find the files as big as >200Mb. Basically, there is everything we process.

Possible solutions

One option would be to constantly reboot/redeploy specstore service once in a while and temp files would be gone.
- Though we would like to do reboot it as less as possible cause Eg: if there is a running process while it's rebooted pipelines will fail.
Another option is to clean up after every time we've used the file
- this would require quite big refactor I guess.

The places we create temp files:

Planner, to create zip package: https://github.com/datahq/planner/blob/master/planner/nodes/output_nodes.py#L31-L32
We dump.to_path before pushing stuff to S3 https://github.com/datahq/planner/blob/ed7fee7b146b2704dc2ab033cf9ea838a6881657/planner/utilities.py#L13

We can create new processor delete.from_path that will try to erase stuff from the given path. And run this processor in the very end of the pipeline

akariv commented 6 years ago

iirc, when using dpp-runner, it allocates a temp directory and it's supposed to remove it after processing is done. I guess that this is not working... I will research it (let's open a separate issue for that)

akariv commented 6 years ago

re autoscaler - I wouldn't go there necessarily. I'm pretty sure we can add to the existing cluster smaller nodes and then delete the bigger nodes - without having to remove the cluster.

zelima commented 6 years ago

@akariv OK so

Diggin a bit it's actually possible to add another pool of nodes with fewer cores. But playing around with it - 3 X 1core CPU is not enough anyway. So after a few changes, I rolled back to 2X2core CPU for testing. (Alternatively, we can have 1X2core + 2X1core nodes)
opened an issue for disk usage: #215
Requesting 1.5 CPU with the limit of 2.0 + 3000 Mi with the limit of 5500 Mi for specstore. leaving k8s do the rest of the work moving other services away to make room.

zelima commented 6 years ago

looking at the node performance right now (while the big dataset is being processed) we still hit the 100% mark for CPU utilization

Even if this dataset succeeds (but I doubt) Think the problem still remains.

So I suggest moving from 3 X 2 core CPU --> To 1 X 4 core CPU + 2 X 1 core. Two is pretty much enough for the rest of the services and the big one will mainly be used by specstore.

Moving to 1X4core CPU for production + 1 X 2core CPU

zelima commented 6 years ago

Closing this as FIXED. We pushed the dataset that was always failing due to lack of CPU and it went fine.

For followup re

disk space usage #215
problem with auto-upgrading of charts #217

datahubio / datahub-v2-pm