Performance of table syncer scheduled task - tables_syncer.py

dosiennik commented 2 years ago

Hi Data.All Contributors :),

Got a question about a performance of the table syncer task which by default is scheduled to be run every 15 mins.

We have got a public data set which has about 12 000 tables. Seems like processing takes really a lot of time. As far as we saw in the code this is a sequential process which iterates through the datasets and its tables and executes bunch of SQL queries for removing all the columns for the table and recreating them. In the meantime every 15 mins another task is started while the other one is still in progress. At the end we ended up with bunch of tasks which consumed all of available IP addresses in the subnets and significantly increased the load on the RDS database.

Moving to the questions:

Are you aware of this sync process performance bottleneck? Have anyone already reported it? Seems like the task management design requires significant improvement.
Are there any plans/ideas for optimisation? It is a blocker for scaling.

Please share your thoughts.

Thanks in advance! :)

dlpzx commented 2 years ago

Hi @dosiennik, thanks for your question!

The first consideration is, why having such a database? It is an anomaly to have a use-case with such an amount of tables in a database. So before re-arching the application we can consider having a second look at the customer database design.

In any case, yes, we are aware of the limitations of the concurrent ECS tasks running and the range of IPs available. For this there are remediation or prevention alternatives that we can consider, such as modifying schedulers, ensuring a sufficient IP range...

It is definitely a topic in which there is room for improvement, so don't hesitate in providing ideas :)

twiechert commented 2 years ago

Hi @dlpzx ,

I am not quite sure I am following. As per my understanding the single task is iterating over all databases and tables (correct me if I'm wrong). Our database size is moderate and still we see happening.

In fact, we applied some of the outlined remediation strategies but in the mid-run we would instead seek a design where:

(I) a worker pool pulls sync task request from a queue (II) the queue is partitioned by dataset/database (III) (optional) have a configurable sync interval w_0 per dataset (what is the intuition of 15 mins by default?) (IV) the tasks have a realistic timeout (V) (optional) failed sync task requests are pushed to a DLQ (VI) sync task request retention is set to w_0

Suggested Services

SQS FIFO queue with messgage group id set to dataset name/id
Multiple EventBridge event generators writing to SQS
Autoscaled ECS task pool consuming from SQS

Do you have any thoughts about this? While this is something we could drive/contribute, we'd like to understand your perspective first.

dosiennik commented 1 year ago

Would be great to provide some more information for the community why it is not planned and how the maintainers see and advocate for the current design of the table syncer.

anmolsgandhi commented 1 year ago

Hi @dosiennik, Apologies for accidental closure of this issue, we appreciate your attention to it. We meant to label it for future-decision. Thank you for pointing it out, i have reopened it for discussion. Currently, its not prioritized but its on our roadmap for future discussion.

data-dot-all / dataall

Performance of table syncer scheduled task - tables_syncer.py #118