gda-score / code

Tools for generating General Data Anonymity Scores (www.gda-score.org)
MIT License
7 stars 1 forks source link

Build dynamic diff priv system using https://github.com/uber/sql-differential-privacy #6

Closed yoid2000 closed 5 years ago

yoid2000 commented 6 years ago

There is an open source differential privacy tool at:

https://github.com/uber/sql-differential-privacy

It is associated with an Uber/Berkeley research project, and supposedly is the basis for Uber's internal implementation of differential privacy.

I would like to be able to query this system over the RESTful interface built by Rohan.

My intern Ali Reza (alimalik113@gmail.com) has played with this system in the past, so you can talk to him about any issues he had and how to interface the tool with the database. Or you can figure it out yourself, whatever you prefer.

Basic setup should be:

client <--> server <--> uber_tool <--> database

The databases are on db001.gda-score.org. There are 12 such databases (all starting with raw_). I don't know if the uber tool will connect to those or if you'll have to for instance copy them to some other format.

What I'd like you to do is get to the point where a client can send a query to the server and receive an answer back from the server.

The server <--> uber_tool interface is entirely up to you.

Please put the code in code/anon-methods/uber

Let me know if you have questions.

rbh-93 commented 6 years ago

I think I will have to add some code to the uber_tool so that it can handle the sql queries from the simpleServer. Should I make changes in sql-differential-privacy/src/main/scala/examples/ElasticSensitivityExample.scala and
sql-differential-privacy/src/main/scala/examples/QueryRewritingExample.scala only?

There are other query analyses (in the Analysis Framework) like com.uber.engsec.dp.sql.AbstractAnalysis.analyzeQuery. Do I need to integrate the Analysis Framework also?

yoid2000 commented 6 years ago

If possible I would like to avoid modifying the uber repository.

Why is it necessary to change the uber repo to recognize sql from simpleServer, versus writing some glue code that gives uber the query syntax it expects?

rbh-93 commented 6 years ago

For example, the ElasticSensitivityExample.scala and the QueryRewritingExample.scala at the moment have a static query defined. Now if I send some query from the simpleServer, there is nothing defined in the uber_tool to handle this incoming query. I had one idea of sending the query from simpleServer as a GET request and creating an endpoint in uber_tool to handle this request. But this would mean changing the tool and involves a lot of rework in the uber_tool code as one file may have multiple dependencies. Is there any other way you can suggest?

yoid2000 commented 6 years ago

How does one currently input a query to the uber thing?

On Mon, Oct 22, 2018 at 7:37 PM Rohan notifications@github.com wrote:

For example, the ElasticSensitivityExample.scala and the QueryRewritingExample.scala at the moment have a static query defined. Now if I send some query from the simpleServer, there is nothing defined in the uber_tool to handle this incoming query. I had one idea of sending the query from simpleServer as a GET request and creating an endpoint in uber_tool to handle this request. But this would mean changing the tool and involves a lot of rework in the uber_tool code as one file may have multiple dependencies. Is there any other way you can suggest?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-431908751, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qR6Iqir_QJX8fmoKQzdBkBzdbQCeks5ungItgaJpZM4XxCGj .

rbh-93 commented 6 years ago

At the moment both QueryRewriting and ElasticSensitivity scripts (these are responsible for adding the noise) have only static queries written in the source code. There isn't any implementation of handling queries dynamically. As for the Analyses Framework (which analyzes queries based on pre-defined analysis techniques), the method we have to invoke is analyzeQuery. This analyzeQuery method extends a base class AbstractAnalysis and takes a string containing a SQL statement as an argument.

yoid2000 commented 6 years ago

How many queries can be placed in the source code?

Does it make sense to just modify the source code file that contains the query with each new query? Would that be the simplest thing to do?

rbh-93 commented 6 years ago

I took a look at the source code for adding noise and query rewriting. It can run one query at a time and technically you can place any number of queries in the source code and keep them commented. While executing, we can use only the query which is required. This is the quickest and simplest solution at the moment if you don't want to change anything in the tool but not the most optimal one.

yoid2000 commented 6 years ago

Let's do the simple solution. I guess for each query write it into the file and execute. I suppose the result is put in some other file? If so then read this, format, and return.

If it is too painfully slow we can reconsider...

On Wed, Oct 24, 2018 at 3:35 PM Rohan notifications@github.com wrote:

I took a look at the source code for adding noise and query rewriting. It can run one query at a time and technically you can place any number of queries in the source code and keep them commented. While executing, we can use only the query which is required. This is the quickest and simplest solution at the moment if you don't want to change anything in the tool but not the most optimal one.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-432658234, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qYur8FHSuQWOhrlLPk8wwEy8IPwHks5uoGyVgaJpZM4XxCGj .

rbh-93 commented 6 years ago

Which database should I work on and what kind of queries to run? "then read this, format, and return." - can you please clarify a bit more? The results are just displayed via a print statement, not put in any file.

yoid2000 commented 6 years ago

The uber tool should be configured to work with 8 databases, those starting with raw_, excluding those ending in _pub.

The queries will be submitted by the client (json), passed to the server which puts it in the file uber expects (sql I presume), executes the uber code to run the query, and somehow returns the answer back to the client.

I presume there are some limitations on what SQL can be used, but I don't know what this is ... you'll have to tell me.

It would be convenient if the answer returned to the client has the same format as the queries that we currently make on the postgres odbc interface, which is a list of lists.

For now, could you show me several examples of queries and the corresponding answers as the uber tool produces them? Then I can tell you how they should be formatted.

On Wed, Oct 24, 2018 at 4:02 PM Rohan notifications@github.com wrote:

Which database should I work on and what kind of queries to run? "then read this, format, and return." - can you please clarify a bit more?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-432669907, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qaYgCkGhZhZwIBTsE9XGFFPsMP3bks5uoHLXgaJpZM4XxCGj .

rbh-93 commented 6 years ago

Thank you for the clarification. I will start working on it.

yoid2000 commented 6 years ago

ok, good luck. Let me know if you run into any doubts.

On Wed, Oct 24, 2018 at 4:44 PM Rohan notifications@github.com wrote:

Thank you for the clarification. I will start working on it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-432686788, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qcrVnMVha7qP69oV-HHsmb4xLmf0ks5uoHy4gaJpZM4XxCGj .

rbh-93 commented 6 years ago

UPDATE:

Tasks completed till now:

Tasks remaining:

yoid2000 commented 6 years ago

Is it possible for you to come meet me this afternoon or tomorrow afternoon? I'd like to discuss this a bit.

PF

On Mon, Oct 29, 2018 at 12:27 AM Rohan notifications@github.com wrote:

UPDATE:

Tasks completed till now:

  • Made some changes to the simpleClient and simpleServer scripts. Will push that to Git.
  • Uber Tool configured to read the JSON file generated by simpleServer.
  • The Uber Tool is connected to a database on my local machine at the moment and is running queries. Results are getting generated. There are some queries which can't be run like you had mentioned.

Tasks remaining:

  • Need to define all the raw_ DB schemas in a file inside the Uber Tool. I will complete this today.
  • At the moment I am stuck on a change I need to make in Uber Tool so that it can extract the SQL query from the JSON file generated by simpleServer. There seems to be a lot of external libraries to do it like Play or Spark but I'm trying to figure out a way such that least amount of changes have to be made to the tool source code.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-433752478, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qVel3D5qUqKGYCGhzNzllShtHwrwks5upj1TgaJpZM4XxCGj .

rbh-93 commented 6 years ago

I can go tomorrow afternoon.

yoid2000 commented 6 years ago

ok see you then

On Mon, Oct 29, 2018 at 12:20 PM Rohan notifications@github.com wrote:

I can go tomorrow afternoon.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-433874811, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qQ7bJNZizlnUTFGDXCov3DXTZ6H2ks5upuSKgaJpZM4XxCGj .

yoid2000 commented 6 years ago

Is it possible to execute the following SQL query on the Uber system?

SELECT col1, col2, count(*)
FROM table
GROUP BY 1,2

Well, I notice that the documentation says this:

Hmmmm, I wonder how it can be overridden....

yoid2000 commented 6 years ago

Or for that matter, can you do any query that shows the column value, even without group by. For instance just this:

select col from table
yoid2000 commented 6 years ago

How is it going? If this task will not be done in the next few days (meaning that I'll be able to make queries to the Uber system from my client machine), then could you install it on db001.gda-score.org so that I can at least log in to that machine and execute it there?

rbh-93 commented 6 years ago

Hi, the task is almost done. The final part I need to do is write the schema of all the 8 raw_ databases into the config file in UBER Tool. I will run the queries you mentioned and let you know asap. Apart from this there were a few things I want to discuss before pushing the final code to Git. Can we meet tomorrow morning?

yoid2000 commented 5 years ago

Yes lets meet tomorrow. I should be in by around 10AM. 11AM is a bit more reliable.

PF

On Wed, Nov 7, 2018 at 4:09 PM Rohan notifications@github.com wrote:

Hi, the task is almost done. The final part I need to do is write the schema of all the 8 raw_ databases into the config file in UBER Tool. I will run the queries you mentioned and let you know asap. Apart from this there were a few things I want to discuss before pushing the final code to Git. Can we meet tomorrow morning?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-436655123, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qc_fdpObeNh7WcS27C5MgQhnPv5gks5usvecgaJpZM4XxCGj .

rbh-93 commented 5 years ago

Is it possible to execute the following SQL query on the Uber system?

SELECT col1, col2, count(*)
FROM table
GROUP BY 1,2

Well, I notice that the documentation says this:

  • Histogram queries (using SQL's GROUP BY) must be handled carefully so as not to leak information in the bin labels.
  • The analysis throws an error to warn about this, but this behavior can overridden if you know what you're doing.

Hmmmm, I wonder how it can be overridden....

I ran the following query: "Query sent by client: SELECT account_id, count(*) FROM accounts GROUP BY account_id" Private result: 2850.0 The exception was: "This query returns a histogram bin column ('account_id') that is not safe for release. The bin labels must be made differentially private, which requires additional data model knowledge. If you know what you're doing, you can disable this message by setting canRelease=true for columns of protected tables that are safe for release, or setting flag [dp.elastic_sensitivity.check_bins_for_release] = false to disable this check."

rbh-93 commented 5 years ago

Or for that matter, can you do any query that shows the column value, even without group by. For instance just this:

select col from table

"Query sent by client: SELECT account_id FROM accounts" Private result: 11382.0 Exception: "UnsupportedQueryException: This analysis works only on counting queries"

Another small thing I noticed while trying out different queries, aliases are not supported by Uber Tool.

rbh-93 commented 5 years ago

I have made a folder called sql-differential privacy in gda-score/code and pushed the code there. Please let me know if everything is alright.

rbh-93 commented 5 years ago

architecture

This is a diagram to depict the data flow between the Client/Server and Uber Tool.

yoid2000 commented 5 years ago

shouldn't the simple server be reading the result.txt? (after all, the client is on a different machine and can't read that file)

On Sun, Nov 11, 2018 at 6:54 PM Rohan notifications@github.com wrote:

[image: architecture] https://user-images.githubusercontent.com/30060044/48316369-06d8ab80-e5e3-11e8-98f5-81945371480e.jpg

This is a diagram to depict the data flow between the Client/Server and Uber Tool.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-437690371, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qfNE1TqAzHud40toAkRvZUipHj_8ks5uuGQqgaJpZM4XxCGj .

rbh-93 commented 5 years ago

Sorry, my mistake. I corrected it. Here is the new architecture diagram. I pushed the corrected code to Git. architecturejpeg

yoid2000 commented 5 years ago

One more comment. I would expect there to be only one response from the server to the client per query. Your drawing suggests there are two...

On Mon, Nov 12, 2018 at 5:18 PM Rohan notifications@github.com wrote:

Sorry, my mistake. I corrected it. Here is the new architecture diagram. [image: architecturejpeg] https://user-images.githubusercontent.com/30060044/48359235-ce95a380-e69c-11e8-8801-269c8aa4f1f6.jpg

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-437934968, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qTVfLVgVKmxvcDgugShxdcujuuw7ks5uuZu3gaJpZM4XxCGj .

rbh-93 commented 5 years ago

The server sends the query result along with the http response code (200 for OK, 500 for Internal Server Error, etc.) for the request sent by the client. I included that so that in case of any error in connection between client and server we can know what went wrong. Should I remove that?

yoid2000 commented 5 years ago

No of course you shouldn't remove it. I just wanted to be clear that both the data and the response code are sent in the same response, not two separate responses.

On Tue, Nov 13, 2018 at 12:56 PM Rohan notifications@github.com wrote:

The server sends the query result along with the http response code (200 for OK, 500 for Internal Server Error, etc.) for the request sent by the client. I included that so that in case of any error in connection between client and server we can know what went wrong. Should I remove that?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gda-score/code/issues/6#issuecomment-438240669, or mute the thread https://github.com/notifications/unsubscribe-auth/ACD-qWcdP32UhNo9P70QjAlzItduag8pks5uurNlgaJpZM4XxCGj .

rbh-93 commented 5 years ago

Yes, the status code is displayed with the queryResult. Not as two separate responses. Please check and let me know if there is anything else you want me to change.