performance of the dashboard importer

mhow2 commented 4 years ago

Context: Recently, I've noticed that the importer didn't make it to the end because ES was going away with the following error:

2019-09-24 16:16:18,195 Retrying (Retry(total=20, connect=21, read=7, redirect=5, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /scava-metrics/items/_bulk?refresh=true

I've been able to circumvented the problem by lowering the BULKSIZE value to 50, the import then ran with the following timeline:

$ docker logs -f scava-deployment_dashb-importer_1 2>&1 |grep 'project metrics'
2019-09-25 13:54:16,420 Looking for 'scheduling' project metrics at url 'http://oss-app:8182'
2019-09-25 14:13:08,350 Looking for 'docdokuplm' project metrics at url 'http://oss-app:8182'
2019-09-25 14:23:53,246 Looking for 'asm' project metrics at url 'http://oss-app:8182'
2019-09-25 14:47:57,418 Looking for 'cliflegacy' project metrics at url 'http://oss-app:8182'
2019-09-25 14:57:40,539 Looking for 'sat4j' project metrics at url 'http://oss-app:8182'
2019-09-25 15:08:24,670 Looking for 'XWiki' project metrics at url 'http://oss-app:8182'
2019-09-25 21:05:46,119 Looking for 'Lutece' project metrics at url 'http://oss-app:8182'
2019-09-25 21:20:47,612 Looking for 'KnowageServer' project metrics at url 'http://oss-app:8182'
2019-09-25 21:20:48,060 Looking for 'spoon' project metrics at url 'http://oss-app:8182'
2019-09-25 22:18:06,959 Looking for 'RocketChat' project metrics at url 'http://oss-app:8182'
2019-09-26 00:53:48,037 Looking for 'docdokudoc' project metrics at url 'http://oss-app:8182'

It about 11 hours to import all the project metrics to ES and represents about 34GB of data (!).

So I'm wondering about the following: 1) what's the problem with the BULKSIZE and why it makes ES to go away ? 2) how it is possible that about 1GB of data in mongo turns into 34GB in ES ?

PS: I also have tweaked the following options -Ehttp.max_content_length=1000mb -Xms4g -Xmx4g but I am unsure it has something to do with the problem. It was just reduce the probability for the importer to fail.

valeriocos commented 4 years ago

why this issue is marked as todo? which work should be done on it?

what's the problem with the BULKSIZE and why it makes ES to go away ?

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed. In GrimoireLab it is set to 1000 by default (https://github.com/chaoss/grimoirelab-sirmordred/blob/master/sirmordred/config.py#L147)

Find below the settings used in the Bitergia prd envs:

elasticsearch:
  restart: on-failure:5
  image: bitergia/elasticsearch:6.1.0-secured
  command: elasticsearch -Enetwork.bind_host=0.0.0.0 -Ehttp.max_content_length=2000mb
  environment:
    - ES_JAVA_OPTS=-Xms2g -Xmx2g
  ports:
    - 9200:9200

how it is possible that about 1GB of data in mongo turns into 34GB in ES ?

It is probably due to 3 main reasons:

If you inspect the data in Elasticsearch, you will see that the data retrieved from Scava is included in each item (attribute scava). This is kept for debugging reason, but it can be removed.
Each scava metric is "unpacked". A metric includes several measurements (e.g., one per day), the scava2es creates a document for each of these measurements. This includes a replication of some fields (e.g., project), which can cause some overhead.
Elasticsearch indexes the data (beyond storing it), so this is probably causing a tiny overhead.

mhow2 commented 4 years ago

Sorry for the TODO, I've not set it myself, I just assigned the dashboard project.

So I could tweak the lenght to 2000mb, and raise the BULKSIZE a little.

The question is, how do I do that without loosing all the data in ES ? If you change the command in the docker-compose, it's not taken into account until you recreate the container (Or I am missing something) so I think I'm gonna edit the files directly in the container

valeriocos commented 4 years ago

Sorry for the TODO, I've not set it myself, I just assigned the dashboard project.

No worries!

The question is, how do I do that without loosing all the data in ...

What do you think about dumping/uploading the elasticsearch data with elasticsearch? I can share a script.

valeriocos commented 4 years ago

Upload Elasticsearch indexes https://gist.github.com/valeriocos/c5dd78b4e06cd73477e1873455f4f585

Dump Elasticsearch indexes https://gist.github.com/valeriocos/b9ac1bb70f7ff981554f3677582b0c9f

mhow2 commented 4 years ago

Thanks Valerio. Roughly speaking the import rate to ES is about 51MB/min which is not that bad... As you already know, it is more that the whole data is uploaded every time we need to update the dashboards (btw the script is configured to runs every 5min by default :D)

valeriocos commented 4 years ago

Performance fixed with https://github.com/crossminer/scava/pull/377

mhow2 commented 4 years ago

15 minutes to import the projects to ES. What else can I say ?

crossminer / scava

performance of the dashboard importer #373