data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
229 stars 82 forks source link

Limit amount of tables dataset can import #1491

Open zsaltys opened 1 month ago

zsaltys commented 1 month ago

One of our users imported a dataset with few glue tables and then someone in their team accidentally ran a misconfigured Glue crawler that created 100,000+ glue tables in their DB..

This caused a LOT of issues..

a) RDS spiked in CPU to the max and was causing all sorts of issues like not being able to scale, nightly updates failing.. b) table synchronizer for that dataset could never finish and it would run for a long time and more and more syncer task instances would run.. When we deleted the tables in glue db it was still not coping because it was trying to sync and calling LF to fix permissions for non existing 100,000 tables and getting throttle errors.

We then tried to remove this dataset.... Removing the shares were very difficult because with the new UI it's very hard to remove just the active S3 share because you can't find it among 100,000 tables so we had to resort to CLI to remove share items and then delete shares.

Once we deletes shares we couldn't delete the dataset either... Eventually we had to manually delete the table records in RDS.. even that was hard because syncer tasks were locking the records and had to stop those first. We then had to run a custom script to clean ES because reindexer does not remove invalid / dead records...

Overall it's an absolute nightmare to solve this issue when something like this happens.

My proposal is let's have a configurable limit how many tables a dataset can have and let's default to 100. This should be small enough so that the syncer could finish running. We could also try to make the syncer more resilient but imo it's still bad to pollute a catalog with 100,000 tables...

zsaltys commented 1 month ago

@anmolsgandhi fyi