alan-turing-institute / sqlsynthgen

Synthetic data for SQL databases
MIT License
11 stars 1 forks source link

Two-stage src-stats #123

Closed mhauru closed 1 year ago

mhauru commented 1 year ago

We converged on the following solution for doing DP queries despite the limitations of snsql:

  1. Run one src-stats query on the SQL server, get the results to the local machine.
  2. In some cases, when no DP is used, these results can be the output. If DP is desired, then run a second query on the results of the first, using snsql. This utilises snsql's ability to run queries on pandas DataFrames.

It's like the idea we've discussed for a while now, of having intermediate tables, except we don't have them as database tables in any SQL database, just as a dataframe in memory.

In some scenarios this might result in a lot of data being pulled to the local machine memory for the purposes of running the DP query on them. This is unfortunate, but we can't see a way of avoiding this, given that we haven't found a way to run DP queries directly on the SQL server for the cases where the query involves SQL features that snsql doesn't support.

Other alternatives investigated were especially the pipeline-dp package, which would interface with Spark and/or Beam, but Iain pointed out that there really isn't advantage in that over using snsql on pandas DataFrames. Other DP packages I looked at were either missing important features or are much more low level.

mhauru commented 1 year ago

Done