fabric8-analytics / fabric8-analytics-worker

fabric8-analytics worker for gathering raw data
GNU General Public License v3.0
8 stars 45 forks source link

graph_importer worker can slow down ingestion considerably #34

Closed msrb closed 7 years ago

msrb commented 7 years ago

It can take forever (almost 1 minute in this particular case) for graph_importer worker to finish. I guess it's waiting for response from data-importer, which in turns is waiting for gremlin.

If my assumption is correct, then the worker is basically blocked and it does nothing during this time. What's even worse, there can be multiple workers processing tasks from graph_importer queue, so in theory significant portion of the system can be doing nothing, just waiting for graph ingestion.

One solution could be to implement changes described in the trello card and then deploy and scale this particular worker separately.

[2017-06-05 07:21:09,246: DEBUG/MainProcess] Task accepted: selinon.SelinonTaskEnvelope[4b9e0b07-e5a7-4a51-bc73-cb2c1cb28a2b] pid:1
[2017-06-05 07:21:09,276: INFO/MainProcess] SELINON bayesian-worker-ingestion-5-g9sdy - TASK_START : {"details": {"dispatcher_id": "62e7ad4e-2693-44dd-af33-edaffbac4d12", "flow_name": "bayesianFlow", "node_args": {"_audit": {"ended_at": "2017-06-04T08:38:54.956815", "started_at": "2017-06-04T08:38:53.805331", "version": "v1"}, "_release": "maven:com.craterdog.java-security-framework:java-certificate-generation:3.15", "document_id": 329103, "ecosystem": "maven", "force": false, "force_graph_sync": false, "name": "com.craterdog.java-security-framework:java-certificate-generation", "version": "3.15"}, "parent": {"ResultCollector": "31e966ee-58a5-4ead-9ec2-38afecd768ce"}, "queue": "prod_ingestion_GraphImporterTask_v0", "task_id": "4b9e0b07-e5a7-4a51-bc73-cb2c1cb28a2b", "task_name": "GraphImporterTask"}, "event": "TASK_START", "time": "2017-06-05 07:21:09.276095"}
[2017-06-05 07:21:09,276: INFO/MainProcess] selinon.SelinonTaskEnvelope[4b9e0b07-e5a7-4a51-bc73-cb2c1cb28a2b]: Invoke graph importer at url: http://172.30.245.248:9192/api/v1/ingest_to_graph
[2017-06-05 07:21:09,278: DEBUG/MainProcess] Starting new HTTP connection (1): 172.30.245.248
[2017-06-05 07:22:02,699: DEBUG/MainProcess] http://172.30.245.248:9192 "POST /api/v1/ingest_to_graph HTTP/1.1" 200 248
[2017-06-05 07:22:02,701: INFO/MainProcess] selinon.SelinonTaskEnvelope[4b9e0b07-e5a7-4a51-bc73-cb2c1cb28a2b]: Graph import succeeded with response: {
  "count_imported_EPVs": 1, 
  "epv": [
    {
      "ecosystem": "maven", 
      "name": "com.craterdog.java-security-framework:java-certificate-generation", 
      "version": "3.15"
    }
  ], 
  "message": "The import finished successfully!"
}
fridex commented 7 years ago

One solution could be to implement changes described in the trello card and then deploy and scale this particular worker separately.

+1, CC @miteshvp

msrb commented 7 years ago

Currently there is over 500 messages available in prod_ingestion_GraphImporterTask_v0 queue and ~200 messages is "in flight" (i.e. messages claimed by workers). The system is almost on hold. Zabbix shows that CPU usage is ~2 Kmillicores (it's ~25 Kmillicores when system is operating properly).

miteshvp commented 7 years ago

We can scale up graph_importer and gremlin_http in such cases. It obviously will be the bottleneck since we have 1 graph_importer vs. 60 workers :)

vpavlin commented 7 years ago

@msrb +1 to implementing the trello card

Are E P V only nodes you are creating in graph? If yes, why does it tak a minute to create it?

fridex commented 7 years ago

A temporary workaround in #35

We can scale up graph_importer and gremlin_http in such cases. It obviously will be the bottleneck since we have 1 graph_importer vs. 60 workers :)

Note that 60 workers (we have 45 BTW now) process also other tasks and a service should handle 60 connections (even if we feed all analyses data - compare it to PostgreSQL). Anyway the main issue is the design approaches that were taken - graph imports should be a Selinon task - having a separate service for this (just only as an intermediate) doesn't make sense at all. Hopefully Gremlin is not bottleneck in our infrastructure, but it's left up to you.

miteshvp commented 7 years ago

@fridex I think I have completely lost you. I guess it is not about connections, but it is more about throughput. 45 workers processing at a rate of 30 seconds excluding sync_to_graph per E:P:V translates to 1.5 packages processed per second throughput. We have tested One data_importer based on past record, can handle only 0.5 package per second. Hence you see that as bottleneck because requests pile up. And for the record, we should also consider network latency, so in effect gremlin-http or data_importer may still be ok cpu wise, in essence graph_sync worker is waiting synchronously for data_importer, which in turn is waiting synchronusly for gremlin-http, which in turn is waiting synchronously for dynamo-db to sync the record. I hope @vpavlin that answers your query. This two seconds is and end-to-end response time including network latency. @fridex as far as design approach is concerned, I recommended data_importer to be part of worker processes, but given the crisp timelines, it was not viable. But things can be different now :).

vpavlin commented 7 years ago

The AZs are usually under 1 millisecond apart in terms of latency

And that's between AZs, we are talking latency inside one AZ, so I would not look for bottle necks in network.

Anyway, it does not answer the question, @miteshvp.

miteshvp commented 7 years ago

Sure @vpavlin. Let me answer it. It does not take a minute to insert into graph. It takes average 2 seconds to insert one record in graph. And that too is end-to-end. Where are you getting one minute response time from?

fridex commented 7 years ago

@fridex I think I have completely lost you. I guess it is not about connections, but it is more about throughput. 45 workers processing at a rate of 30 seconds excluding sync_to_graph per E:P:V translates to 1.5 packages processed per second throughput. We have tested One data_importer based on past record, can handle only 0.5 package per second. Hence you see that as bottleneck because requests pile up.

Mitesh, is bottleneck data_importer or other parts in the package-to-graph sync? Have you evaluated some benchmarks? Or are you saying that gunicorn or your code cannot handle 60 requests?

And for the record, we should also consider network latency, so in effect gremlin-http or data_importer may still be ok cpu wise, in essence graph_sync worker is waiting synchronously for data_importer, which in turn is waiting synchronusly for gremlin-http, which in turn is waiting synchronously for dynamo-db to sync the record. I hope @vpavlin that answers your query. This one minute is and end-to-end response time including network latency.

Huh...

@fridex as far as design approach is concerned, I recommended data_importer to be part of worker processes, but given the crisp timelines, it was not viable. But things can be different now :).

There was a discussion back few months were you have decided to create an "abstraction" for syncing data. Do you want me to find it for you? The final decision regarding data_importer as a task was done in https://github.com/baytemp/common/issues/42. I don't see you there.

vpavlin commented 7 years ago

See the top comment from @msrb - there is log - see the task start and task finish times

miteshvp commented 7 years ago

Thats not the average. That's just one task. Probably because there is a huge pile up already. That includes your wait time. Let's talk on average numbers to actually insert, wait, latency.

vpavlin commented 7 years ago

Sounds good, can you collect and share the average numbers?

miteshvp commented 7 years ago

@vpavlin I was hoping you can help us to get these numbers in production :) if there is a way in Openshift. @fridex so I had not documented my theory anywhere about data_importer part of selinon tasks, it was more of a face to face discussion when Slavek visited India. And while you search about that discussion around "abstraction" for graph_sync, please check the whole history, else you may miss the "why" part :) Overall, as you mentioned, we should move graph_sync as part of selinon family :) for this issue.

fridex commented 7 years ago

And while you search about that discussion around "abstraction" for graph_sync, please check the whole history, else you may miss the "why" part :)

So why was it?

fridex commented 7 years ago

Closing this as we have separate workers for graph ingestion now. Feel free to reopen once this issue will raise again.