Closed mhow2 closed 4 years ago
why this issue is marked as todo? which work should be done on it?
what's the problem with the BULKSIZE and why it makes ES to go away ?
The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed. In GrimoireLab it is set to 1000 by default (https://github.com/chaoss/grimoirelab-sirmordred/blob/master/sirmordred/config.py#L147)
Find below the settings used in the Bitergia prd envs:
elasticsearch:
restart: on-failure:5
image: bitergia/elasticsearch:6.1.0-secured
command: elasticsearch -Enetwork.bind_host=0.0.0.0 -Ehttp.max_content_length=2000mb
environment:
- ES_JAVA_OPTS=-Xms2g -Xmx2g
ports:
- 9200:9200
how it is possible that about 1GB of data in mongo turns into 34GB in ES ?
It is probably due to 3 main reasons:
If you inspect the data in Elasticsearch, you will see that the data retrieved from Scava is included in each item (attribute scava
). This is kept for debugging reason, but it can be removed.
Each scava metric is "unpacked". A metric includes several measurements (e.g., one per day), the scava2es creates a document for each of these measurements. This includes a replication of some fields (e.g., project), which can cause some overhead.
Elasticsearch indexes the data (beyond storing it), so this is probably causing a tiny overhead.
Sorry for the TODO, I've not set it myself, I just assigned the dashboard project.
So I could tweak the lenght to 2000mb, and raise the BULKSIZE a little.
The question is, how do I do that without loosing all the data in ES ? If you change the command
in the docker-compose, it's not taken into account until you recreate the container (Or I am missing something) so I think I'm gonna edit the files directly in the container
Sorry for the TODO, I've not set it myself, I just assigned the dashboard project.
No worries!
The question is, how do I do that without loosing all the data in ...
What do you think about dumping/uploading the elasticsearch data with elasticsearch? I can share a script.
Upload Elasticsearch indexes https://gist.github.com/valeriocos/c5dd78b4e06cd73477e1873455f4f585
Dump Elasticsearch indexes https://gist.github.com/valeriocos/b9ac1bb70f7ff981554f3677582b0c9f
Thanks Valerio. Roughly speaking the import rate to ES is about 51MB/min which is not that bad... As you already know, it is more that the whole data is uploaded every time we need to update the dashboards (btw the script is configured to runs every 5min by default :D)
Performance fixed with https://github.com/crossminer/scava/pull/377
15 minutes to import the projects to ES. What else can I say ?
Context: Recently, I've noticed that the importer didn't make it to the end because ES was going away with the following error:
2019-09-24 16:16:18,195 Retrying (Retry(total=20, connect=21, read=7, redirect=5, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))': /scava-metrics/items/_bulk?refresh=true
I've been able to circumvented the problem by lowering the
BULKSIZE
value to50
, the import then ran with the following timeline:It about 11 hours to import all the project metrics to ES and represents about 34GB of data (!).
So I'm wondering about the following: 1) what's the problem with the BULKSIZE and why it makes ES to go away ? 2) how it is possible that about 1GB of data in mongo turns into 34GB in ES ?
PS: I also have tweaked the following options
-Ehttp.max_content_length=1000mb
-Xms4g -Xmx4g
but I am unsure it has something to do with the problem. It was just reduce the probability for the importer to fail.