datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

Analyse frontend and other services CPU/RAM usage #197

Closed zelima closed 6 years ago

zelima commented 6 years ago

As a datahub team, we want to know if any of our services (esp frontend and specstore) is exceeding the limit of allocated CPU or RAM, so that we know for sure services were not down due to that fact, or if they were, adjust/allocate more, so that that does not repeat.

Acceptance Criteria

Tasks

Analysis

Set up the dashboard with some useful charts, using the Google Stackdrive app, for us to analyze how our services are performing.

We have 2 kubernetes clusters with 3 nodes 2 * 3.75 GBvCPUs each, making 22.50 GB of total memory - one for testing environment and another for production.

Testing VS production

Looking at the testing and production graphs seperatly:

Specstore VS other services

We have problem with alocating resources for specstore service. While it has alocated 1200 of CPU and 2000 Mi of memory it still seems not enough. It is using >80% of CPU quite often and ussually we result in errors while processing datasets. Here you can see the graph of CPU useage by specstore (Three different collors = 3 re-deoploy of specstore due to it's being stuck). As you an see the other services are barelly seen on the graph.

gke container - cpu usage for datahub-production

Another interesting graph. Here you can see the disk space usage is constantly going up until servcie is redeployed for various reasons.

gke container - disk usage for datahub-production

This makes me thinkg that we are saving stuff on the disc an never deleting for some reasons. Eg processing of this dataset crashed due to memory lack: https://api.datahub.io/source/sports-data/atp-world-tour-tennis-data/2

Conclusion and solution

  1. Resources for the testing cluster are over-allocated. We can downgrade them to have 3 nodes with 1 CPU each (or even 2). We can use those resources for production.
  2. Dedicate 1 node specifically only for specstore 2.1 Increase CPU usage for that node 2.2 Increase memory
  3. Double replicas for specstore.
  4. Investigate the reason the disk space is constantly increasing (temporary files are not deleted?)
akariv commented 6 years ago

@zelima agree with 1, 2 & 4. Would wait with 3 until we understand better the rest of the issues.

zelima commented 6 years ago

@akariv OK.

re 1: looking into Gcloud docs seems like there's no easy way to decrease the number of cores per node. we can only increase/decrease the number of nodes in the cluster. to visualize:

If we really want it that way we can create a new cluster with 3 nodes and 1 core each, delete old one and deploy services there.

Alternatively, we can enable the Cluster Autoscaling feature and let it take care of efficiently allocating resources. https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler

Kubernetes Engine's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, Kubernetes Engine automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, Kubernetes Engine can delete the node.

What do you think?

zelima commented 6 years ago

Analysing the disk space and usage for specstore:

Total disk space

df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  94.3G     21.6G     72.6G  23% /
tmpfs                     3.7G         0      3.7G   0% /dev
tmpfs                     3.7G         0      3.7G   0% /sys/fs/cgroup
/dev/sda1                94.3G     21.6G     72.6G  23% /dev/termination-log
/dev/sda1                94.3G     21.6G     72.6G  23% /etc/resolv.conf
/dev/sda1                94.3G     21.6G     72.6G  23% /etc/hostname
/dev/sda1                94.3G     21.6G     72.6G  23% /etc/hosts
shm                      64.0M         0     64.0M   0% /dev/shm
tmpfs                     3.7G     12.0K      3.7G   0% /run/secrets/kubernetes.io/serviceaccount
tmpfs                     3.7G         0      3.7G   0% /proc/kcore
tmpfs                     3.7G         0      3.7G   0% /proc/timer_list
tmpfs                     3.7G         0      3.7G   0% /sys/firmware

size of /tmp folder

du -h /tmp

...
4.0K    /tmp/tmp4kah28oh
4.0K    /tmp/tmpng5hq1jn
7.5G    /tmp

Did a bit of research on how /tmp directory is cleaned up and it seems it done only on reboots. See this comments. Since after booting up container is never rebooted, I guess the files are remaining there forever???

The files themselves in /tmp dir are all kind - .zip, .csv, .json and md. You can find the files as big as >200Mb. Basically, there is everything we process.

Possible solutions

The places we create temp files:

We can create new processor delete.from_path that will try to erase stuff from the given path. And run this processor in the very end of the pipeline

akariv commented 6 years ago

iirc, when using dpp-runner, it allocates a temp directory and it's supposed to remove it after processing is done. I guess that this is not working... I will research it (let's open a separate issue for that)

akariv commented 6 years ago

re autoscaler - I wouldn't go there necessarily. I'm pretty sure we can add to the existing cluster smaller nodes and then delete the bigger nodes - without having to remove the cluster.

zelima commented 6 years ago

@akariv OK so

zelima commented 6 years ago

looking at the node performance right now (while the big dataset is being processed) we still hit the 100% mark for CPU utilization

Even if this dataset succeeds (but I doubt) Think the problem still remains.

So I suggest moving from 3 X 2 core CPU --> To 1 X 4 core CPU + 2 X 1 core. Two is pretty much enough for the rest of the services and the big one will mainly be used by specstore.

Moving to 1X4core CPU for production + 1 X 2core CPU

zelima commented 6 years ago

Closing this as FIXED. We pushed the dataset that was always failing due to lack of CPU and it went fine.

For followup re