Closed msrb closed 7 years ago
One solution could be to implement changes described in the trello card and then deploy and scale this particular worker separately.
+1, CC @miteshvp
Currently there is over 500 messages available in prod_ingestion_GraphImporterTask_v0
queue and ~200 messages is "in flight" (i.e. messages claimed by workers). The system is almost on hold. Zabbix shows that CPU usage is ~2 Kmillicores (it's ~25 Kmillicores when system is operating properly).
We can scale up graph_importer and gremlin_http in such cases. It obviously will be the bottleneck since we have 1 graph_importer vs. 60 workers :)
@msrb +1 to implementing the trello card
Are E P V only nodes you are creating in graph? If yes, why does it tak a minute to create it?
A temporary workaround in #35
We can scale up graph_importer and gremlin_http in such cases. It obviously will be the bottleneck since we have 1 graph_importer vs. 60 workers :)
Note that 60 workers (we have 45 BTW now) process also other tasks and a service should handle 60 connections (even if we feed all analyses data - compare it to PostgreSQL). Anyway the main issue is the design approaches that were taken - graph imports should be a Selinon task - having a separate service for this (just only as an intermediate) doesn't make sense at all. Hopefully Gremlin is not bottleneck in our infrastructure, but it's left up to you.
@fridex I think I have completely lost you. I guess it is not about connections, but it is more about throughput. 45 workers processing at a rate of 30 seconds excluding sync_to_graph per E:P:V translates to 1.5 packages processed per second throughput. We have tested One data_importer based on past record, can handle only 0.5 package per second. Hence you see that as bottleneck because requests pile up. And for the record, we should also consider network latency, so in effect gremlin-http or data_importer may still be ok cpu wise, in essence graph_sync worker is waiting synchronously for data_importer, which in turn is waiting synchronusly for gremlin-http, which in turn is waiting synchronously for dynamo-db to sync the record. I hope @vpavlin that answers your query. This two seconds is and end-to-end response time including network latency. @fridex as far as design approach is concerned, I recommended data_importer to be part of worker processes, but given the crisp timelines, it was not viable. But things can be different now :).
The AZs are usually under 1 millisecond apart in terms of latency
And that's between AZs, we are talking latency inside one AZ, so I would not look for bottle necks in network.
Anyway, it does not answer the question, @miteshvp.
Sure @vpavlin. Let me answer it. It does not take a minute to insert into graph. It takes average 2 seconds to insert one record in graph. And that too is end-to-end. Where are you getting one minute response time from?
@fridex I think I have completely lost you. I guess it is not about connections, but it is more about throughput. 45 workers processing at a rate of 30 seconds excluding sync_to_graph per E:P:V translates to 1.5 packages processed per second throughput. We have tested One data_importer based on past record, can handle only 0.5 package per second. Hence you see that as bottleneck because requests pile up.
Mitesh, is bottleneck data_importer or other parts in the package-to-graph sync? Have you evaluated some benchmarks? Or are you saying that gunicorn or your code cannot handle 60 requests?
And for the record, we should also consider network latency, so in effect gremlin-http or data_importer may still be ok cpu wise, in essence graph_sync worker is waiting synchronously for data_importer, which in turn is waiting synchronusly for gremlin-http, which in turn is waiting synchronously for dynamo-db to sync the record. I hope @vpavlin that answers your query. This one minute is and end-to-end response time including network latency.
Huh...
@fridex as far as design approach is concerned, I recommended data_importer to be part of worker processes, but given the crisp timelines, it was not viable. But things can be different now :).
There was a discussion back few months were you have decided to create an "abstraction" for syncing data. Do you want me to find it for you? The final decision regarding data_importer as a task was done in https://github.com/baytemp/common/issues/42. I don't see you there.
See the top comment from @msrb - there is log - see the task start and task finish times
Thats not the average. That's just one task. Probably because there is a huge pile up already. That includes your wait time. Let's talk on average numbers to actually insert, wait, latency.
Sounds good, can you collect and share the average numbers?
@vpavlin I was hoping you can help us to get these numbers in production :) if there is a way in Openshift. @fridex so I had not documented my theory anywhere about data_importer part of selinon tasks, it was more of a face to face discussion when Slavek visited India. And while you search about that discussion around "abstraction" for graph_sync, please check the whole history, else you may miss the "why" part :) Overall, as you mentioned, we should move graph_sync as part of selinon family :) for this issue.
And while you search about that discussion around "abstraction" for graph_sync, please check the whole history, else you may miss the "why" part :)
So why was it?
Closing this as we have separate workers for graph ingestion now. Feel free to reopen once this issue will raise again.
It can take forever (almost 1 minute in this particular case) for
graph_importer
worker to finish. I guess it's waiting for response fromdata-importer
, which in turns is waiting forgremlin
.If my assumption is correct, then the worker is basically blocked and it does nothing during this time. What's even worse, there can be multiple workers processing tasks from
graph_importer
queue, so in theory significant portion of the system can be doing nothing, just waiting for graph ingestion.One solution could be to implement changes described in the trello card and then deploy and scale this particular worker separately.