TonicAI / condenser

Condenser is a database subsetting tool
https://www.tonic.ai
MIT License
312 stars 48 forks source link

Enable better handling of Large databases with limits and pre-filtering #29

Closed mseverini closed 2 years ago

mseverini commented 2 years ago

Thank you so much for open-sourcing this project! This is an incredibly useful tool!

I am trying to use this for a relatively simple database that has a lot of data in it. It was taking a long time to subset. Two relatively simple changes made this process a lot faster with relatively few trade offs:

1) Pre-filtering: The data subsetter used to copy all of the data to a temporary table before applying any specified filters. This can obviously be sped up by only copying the data that will actually be used. In my case, I have some long time series data, that, at least for testing, I really only need the most recent few percent. By applying the filters before copying to the temporary table My copies got much faster. As an extra side benefit, if your target database has some corrupted data, you can filter it out without the subsetter needing to worry about it.

2) Limits: This is a bit of a hammer, but I think it is useful for a few reasons. First while you are developing/debugging your config file you can iterate faster. Also, It enables using the tool on databases that have one or two very large tables (logs for example).

Thanks again. Happy to discuss or support in any way I can.

theaeolianmachine commented 2 years ago

Thanks again for submitting @mseverini!