Open jkarpen opened 2 months ago
The problem Ian stated was valid, in that data relay was not designed to be "multi-tasking," more specifically, it can only sequentially execute the planned tasks. Therefore, if the plan changes, such as in the backfill situation where we need to go back in history to redo the relay, the current data relay has to choose either completely cancel current existing plans and focus on the backfill, or continue to execute the existing plan all the way until its completion, after which other tasks are possible to execute.
The root cause is the single queue design.
To mitigate the situation that the last task wait too long, I change the design to be multi-task-queue like below:
Thanks for the thoughtful discussion. Document the QnA:
Is this the only solution for the challenge? Answer: the other solution would be to increase the machines (or processes) that pulls the data base, therefore the task queue has a faster speed to complete. The problem is, still single thread, and also, increasing the data pull will exert pressure on the production database.
Would this address the "logic separation" concern? Answer: Yes. The tasks from 30sec relay, daily config relay and the backfill will have their fair treatment, due to the fact that they are no longer assigned to a single queue, and their respective queue will have guaranteed handling from the data puller.
What is the major risks or difficulties on executing this plan? Answer: beyond the basic coding, the major challenge is the operation: the single queue will have to be migrated to the multiple queue through a sequence of delicate Kafka operations, and it has to be done in a service down period to eliminate the impact to the production. From the dashboard, I can locate such opportunity window to perform the operations without production impact.
The next step is to execute the plan.
Pingping successfully implemented the changes outlined above. As a next step before closing this issue Pingping will like to implement a dashboard to track performance of the different queues. Pingping will meet with @ian-r-rose when he returns next week for input on the KPIs to use.
Next step on this task is to document the code. Pingping will create a separate issue for creating a dashboard to track performance.
Per @pingpingxiu-DOT-ca-gov there is not a need to add a new dashboard, existing dashboards should capture any issues. Next step on this issue is for Pingping and @ian-r-rose to meet and review the code.
Per @pingpingxiu-DOT-ca-gov this is waiting on the virtual environments PR to be completed. Then Pingping will submit a PR for this to be further reviewed before completion.
At one point the 30-second raw data pipeline was tightly coupled with the config table uploads. This meant that an incident in one pipeline could affect the others. As an example, in June there was an incident where the data relay server was down for a couple of weeks. It took almost a week of data crawling to recover, and the config table uploads were scheduled behind the 30-second data uploads. Because of the tight coupling, it took a long time to update the config tables (the scripts for which can run in under a minute), even though they are logically separate.
Going forward, the data relay server should be able to schedule the different parts of the pipeline independently so that incidents (or incident recovery) in one of them do not affect the others.
Note: there has been some refactoring of the upload scripts since the above incident, so the coupling may not be the same now.