Closed macumber closed 4 years ago
My thinking at the moment is instead of making the repo private to instead replace any sensitive data in the PoC with a synthetic version that shares similar properties. For example, for NMBSE, generating synthetic values that are distributed like the real values from the private CSV data.
The reasoning is that the PoC does not enforce a global privacy budget, allowing users to learn the underlying data to an arbitrary precision. I'd rather not take on the burden of locking down the PoC with access control and proper security protocols at this time. Not to mention that it adds friction to demos of the PoC. My thinking is that it's better for the PoC to be easily shared and used by those interested at this stage of the project.
This, by the way, is the approach I have taken over the last year with design docs/notebooks so that we don't have to worry about accidentally sharing sensitive information.
@mcgeeyoung any thoughts on the matter?
The reason here is that the lat/lon of buildings in the data set are protected. If we want to show those on a map that could be a problem. If we only show country then I guess that is not an issue. Are you suggesting that we only show country or that we add some sort of noise to generate new random lat/lons?
Couple of ways to approach it!
If, say, we're doing this for San Francisco, I could use some of the public location data from the Building Performance Database instead, replacing locations in the sensitive dataset with locations from the public dataset. This way, it renders nicely on the map but doesn't have any connection to the sensitive locations.
Hi Marc, so it means all other results will be based on BDGP data set, but only the "locations" in the map will be based on San Francisco project, correct? I like the idea, but I'd have to go back and check documents whether that meets the purpose.
If other locations would be helpful, I can probably find some public data to swap in. For example, the Building Performance Database has a bunch of public datasets that would be useful for this purpose: https://bpd.lbl.gov/#explore
closing.
All data should be in a private repo until we can show to Clayton Miller and discuss with him. URL to the prototype should not expose any of the underlying CSV data.