Obfuscate legi number in database

fubu commented 9 years ago

For privacy reasons the matriculation (legi) number is not supposed to be used together with the N.ETHZ name. (See for example the exam results for Basis/Block exams in front of the department office)

At AMIV, we used to collect the matriculation number along with the N.ETHZ name for signups for a long time now (but without any validation). We never really needed it, except for when scanning the barcode on the legi card (which encodes the matriculation number), e.g. in the GV tool. This is always one-way: Having a legi number, determine the user (not even the N.ETHZ) behind it. Does anyone have a reasonable use case for the opposite direction?

My Proposal: Obfuscate the legi number inside our database in some way (treat it like a password).

It prevents ordinary users from seeing/spying on other users, even accidentally when browsing through the user table.
It can add an additional layer of (data) security: When an attacker gains access to the database only, he does not get the nethz<->legi mapping for free.
It does not really hurt our API: The legi->nethz lookups are not so common.

For the obfuscation mechanism: My first idea was to use off-the-shelf encryption (e.g. AES) with a application-wide password loaded from the configuration (which needs to be protected from being readable to other users on the ISG servers anyways!). There might be better approaches (maybe PBKDF2?), but first: What is your opinion?

(Please consider that we open-sourced this, and people might use this software without thinking about such things - I think it's worth it to go this extra mile for future users.)

fubu commented 8 years ago

Now that we have transparent password hashing in place, I want to bring this issue up again: Is there a reason not to one-way-hash/obfuscate the legi number in the database (treat it like a password)?

Thanks to ff15c32, the implementation is now straightforward and dead-simple.

If we do as proposed:

it is still possible to search for a given legi (like it's possible to verify a password: you'd need a user id or some other second authentication factor to make the search efficient)
it gets much more difficult to just browse the users for legi numbers
it is always possible to use the LDAP directly when a broad search is really needed

I currently don't know of a (larger scale) use case which would require the plaintext legi number in the database. What am I missing @Leonidaz0r, @marcoep, @NotSpecial?

cburchert commented 8 years ago

I still don't really get how it is an advantage. If the API can decode it, there is no real use to encrypt it, as someone getting database access will probably also be able to abuse the API. He can just use his database access to make himself a root user in the API and extract whatever he wants. We use the legi number to map legi barcodes to people at ESF and at GV. If we don't want to program those tools directly against LDAP (which I would not recommend, as we care in those cases whether those people are AMIV members, meaning we would need to implement both interfaces and map between them as well) then we must store the legi number or get it on demand. In both cases the API is able to map legi numbers, therefore anyone with database access can extract users based on their legi number.

I am not sure I understand how your encryption method should work. I think you want to encrypt the legi numbers with a key, which is stored in the API configuration so a database dump alone would not be enough to extract them. Is that correct?

fubu commented 8 years ago

My main motivation (quote from above):

For privacy reasons the matriculation (legi) number is not supposed to be used/requested together with the N.ETHZ name.

and:

(Please consider that we open-sourced this, and people might use this software without thinking about such things - I think it's worth it to go this extra mile for future users.)

Although we might get the data privacy stuff right (and I doubt it when I look back at my years in the AMIV board with much harder access to this information), I'm not so sure about other organizations using our code. (If I understood Oli correctly, there are already other student associations interested in our tools.)

I think the simplest countermeasure is to just not include the legi number in the output of /users, except maybe for admins. (Oli told me that sometimes parts of the legi number might be of interest, e.g. the immatriculation year, so an obfuscated number like 09-***-*** could be enough for admins.) Searching for a specific legi number should also only be allowed for admins.

Please note that theses measures are not even defending against an adversary, but simply against the innocent "data snooping" by members of the organization (it happens, I saw it).
The next step would be to protect against database dumps revealing this information to a malicious attacker. This can be done by only having encrypted legi numbers in the database, decrypted on demand by the application. I'm thinking about one-way-hashes with an application-wide salt/secret.

The second part is only icing on the cake. My primary goal is to restrict access to the information; the hashing/encrption could be added later/transparently without breaking backwards compatability.

(If an attacker gains access to both the source as well as the database, then the only solution would be to not store the information in the first place. Could be achieved by transparently doing an ldap-search behind the scenes and mapping the result back to a user in the database. But since the attacker also got the LDAP password, he has access to all data anyways.)

amiv-eth / amivapi

Obfuscate legi number in database #73