Closed yoid2000 closed 1 year ago
In F# the salt was generated by hashing the input file name. We don't have a file name anymore, just a DataFrame with the data. We could ask the caller for a unique name that identifies the data or, if none is provided, we could hash the input data. If we generate a random salt each time, the results will always differ. If we generate one per system, we have to store it somewhere and it will be the same for different inputs. Not ideal in any case.
In F# the salt was generated by hashing the input file name.
I miss-remembered this. I just checked and we actually hash the data itself. How about we do the same in the Python implementation?
Hashing the data itself doesn't make sense here because the user may give different columns from the same dataset, and we would like all of these to have the same salt (versus the F# case, where we always supplied the whole CSV file. One of the reasons we started using per-table salt is because different users with different implementations might anonymize the same CSV file, and it is good if these all use the same salt.)
Since we can't do that here, I think a good solution is to have one salt per installation (similar to how we did it for Aircloak). For now, we can just store the salt in the same place we currently store it (the python file in the local distribution). This breaks when the user upgrades the library, but I think we can live with that for now. (we can consider storing the salt in some file or environment variable somewhere, but let's not worry about that for now). If a user is really concerned about this, they can just make a local copy of the salt.
Given this, maybe we should also have a couple of API calls like import_salt(salt)
and salt = export_salt()
so that the user can manage the salt...
I'm not yet sure how file storage for a Python library should best be done, I'll look into it.
Since we can't do that here
Actually, Edon's initial API proposal was requiring the entire dataset as input and separate arguments with the data and AID columns names. I changed it to the current design in order to simplify it, but we could go back to it in order to have the entire dataset available for hashing. Or just provide another API that creates the salt from the dataset.
I prefer the current design. Besides being more natural for dataframes, people will need to do things like chop up the dataframe in order avoid memory problems with large datasets.
Unless you have a major reason why we shouldn't 1) generate an implementation-wide good salt upon first use, and 2) provide import/export
routines to allow management of the salt, then let's go with that design.
generate an implementation-wide good salt upon first use
Ok, but keep in mind that the first time the library will be used with a default salt, the user will need to have root/admin rights in order to write to the system config folder.
provide import/export routines to allow management of the salt
The current salt can be set when initializing the Synthesizer
object and read afterwards.
For the default salt, the config file can be moved from one system to another.
Not sure if anything more will be needed here...
Ok, but keep in mind that the first time the library will be used with a default salt, the user will need to have root/admin rights in order to write to the system config folder.
Good point. Not sure how to avoid this.
The current salt can be set when initializing the Synthesizer object and read afterwards.
This would also require that the user has the appropriate rights, no? Likewise to install syndiffix-py at all requires the necessary permissions.
So isn't it the case that we have two scenarios:
The second case is not a problem ... the salt will be correctly set on first use.
The first case really requires that the sysadmin set a salt upon installation, and if we care about this case (I don't think we should until proven otherwise), then we would presumably put in some kind of call that lets the sysadmin set the salt???
Plus I guess there are the usual other ways of doing this ... putting a file reachable from the users home directory that holds the salt, etc...
This would also require that the user has the appropriate rights, no?
No, the current salt can be set and read in memory, no special rights are needed. The default salt has to be written somewhere, so we can remember it in the future. Writing it to a global location requires root/administrator rights.
Likewise to install syndiffix-py at all requires the necessary permissions.
The library can be installed per user, so no root/administrator rights are needed for that.
The second case is not a problem ... the salt will be correctly set on first use.
It is a problem, because it needs to written in a global location so that all users in the system use the same salt.
Plus I guess there are the usual other ways of doing this ... putting a file reachable from the users home directory that holds the salt, etc...
So you actually want to store the default salt per-user? Maybe I misunderstood you...
Let's discuss at today's weekly.
No, the current salt can be set and read in memory
I don't understand this ... how does the salt get into memory in the first place?
The pure requirement is that a given dataset always be subject to the same salt. But it would be a fair amount of work to satisfy this properly, because different people may work on the same dataset. So I'm not looking for a complete solution at this time.
I think a per-installation salt is a reasonable compromise for now...but let's discuss.
Do we have anything that configures the salt?
It seems to me that we should automatically create a good salt the first time syndiffix-py is run.
I see that the typical usage for obtaining the seed is like this:
What we could do instead is to use a
get_salt()
routine instead ofanon_params.salt
, which always checks to see if the salt is set to something other than the default, and if not it sets it to a cryptographically strong value. Here is a library for that:https://docs.python.org/3/library/secrets.html