NYUCCL / psiTurk

An open platform for science on Amazon Mechanical Turk.
https://psiturk.org
MIT License
277 stars 140 forks source link

psiTurk and IRBs #175

Open agardony opened 9 years ago

agardony commented 9 years ago

Institutional Review Boards (IRBs) may consider turk workerIDs personally identifiable information (PII). Recently, it has been revealed that in many cases googling a workerID can lead to that turk worker’s public profile page which can include personal information, such as amazon wish lists.

See this article http://www.theverge.com/2013/3/7/4075810/amazon-mechanical-turk-users-study-finds-half-have-public-profiles and the linked blog post for discussion of these issues. http://webcache.googleusercontent.com/search?q=cache:crowdresearch.org/blog/%3Fp%3D5177&hl=en&strip=1

Here is an example of one of these public profiles: http://www.amazon.com/gp/pdp/profile/A3IZSXSSGW80FN

If workerIDs can be considered PII there may be consequences in getting psiTurk based experiments approved by an IRB.

IRBs often have restrictions about pairing PII directly to data for confidentiality reasons. My IRB, for example, requires that data be kept in a separate file from PII and that the two can only be linked via an intermediate code. Currently psiTurk saves workerIDs in the same database as data which does not satisfy this requirement.

Perhaps psiTurk’s data saving procedure should be modified to comply with this common IRB requirement. I suggest the following.

AssignmentIDs are a good candidate for an intermediate code since they are randomly generated by amazon and represent the one-off pairing of a specific worker to a specific HIT. psiTurk could save workerIDs (PII) and assignmentIDs to a separate database. Ideally this would be a secure password-protected database, such as a remote mysql db. This database would be queried by psiTurk in order to keep track of participants and disallow repeat participation from the same worker. It would also be used for various psiTurk functions such as paying all workers who completed hit, bonusing workers, etc.

psiTurk would then save data paired to assignmentIDs in a separate database. This database need not be as secure as the PII database (could be locally stored sqlite one, for example). The download_data_files command would parse and download data from this database so that all data that experimenters or research assistants review would be paired only to assignmentIDs and not workerIDs (PII). In this way confidentiality could be preserved while still maintaining psiTurk’s core functionality.

ghost commented 9 years ago

:+1: This is a super important consideration, especially for research institutions.

@RocChi tried something really interesting in this space: https://github.com/RocHCI/mturk-consent

gureckis commented 9 years ago

Thanks for bringing up this issue. I've been thinking a bit about this. However, I'm not sure how you can block the same worker from doing an experiment twice without recording the workerID with the data. To make this clear, I'll give an example from how psiTurk actually works...

So, if a worker accepts a task on psiTurk they might read through the instructions. At this point they could return the hit and decide not to do the experiment. Later on they might change their mind and accept another hit for the same experiment (as part of the same HIT or a different one). psiTurk would check this workerID against the experiment's database and see that while this worker exists and consented to the study already, they did not hit the "manipulation" part of the task and so they are ok to continue. This ability to "check" if a worker got far enough into an experiment to be "exposed" to the manipulation requires some way to link the datafile to the individual worker at the time of recruitment.

I agree that after the experiment is over all one needs to know is that the worker already did the task, and you don't need to see their datafile to judge that. However, it isn't clear to me in a general way how psiTurk could separate these two things. For example, putting the workerID in a separate database table from the data itself means that it only takes a simple SQL "Join" to connect them back up again. At that point they might as well be in the same database table.

The only way to do this in a convincing way would be if the workerId was somehow hidden from the experimenter by psiTurk itself (e.g., by the psiTurk cloud). However, I'm not sure how an experimenter would be able to communicate to the cloud across lots of different experiments which subsets should block people, etc... right now that is sort of handled by the experimenter themselves.

It's also worth pointing out that psiTurk isn't unique in associating workerId's with the data. For example, many people store the data from their task directly in a field on the AMT website associated with each worker. Thus, there is often some link between the data and the workerID.

If you use psiTurk on a MySQL database the data is protected with a password. The database itself links workerIDs to the data fields. However, if the researcher wants they can download a text or CSV file of this database with the unique identifier removed for analysis and data sharing. Maybe your proposal could be implemented easily by just changing the download_data_files command such that the workerID is not output in this text file by default. Thus, the SQL database that is running within psiTurk is your password-protected "lookup" book that can connect workers to datafiles but in general daily practice people can avoid passing that data around widely by using the "cleaned" version.

Also, my understanding is that it is not against most IRBs to collect PII. Indeed, we often collect PII for tax purposes when paying people. The problem is that many researcher using AMT claim that the workerID is not PII in their proposals and that may not be true. The fact that it can in some instances be linked back to amazon profiles shouldn't be a problem as long as the researcher notes the steps taken the ensure the protection of the workerID information (for example the procedure I just described of separating it prior to analysis and protecting the linked data with a password). The benefits here are that the association of these two data point ensures that the same worker cannot complete the task twice after being exposed to the manipulation which is scientifically important for many designs.

agardony commented 9 years ago

I think modifying the download_data_files command would be the easiest way to address this issue. My lab was able to get IRB approval for our psiturk-based study after I wrote a script that downloaded the data paired to assignment ID instead of worker ID.

Since this data file is the one that would be passed around the lab for analysis it is good that it doesn't have any PII (worker IDs). Worker IDs are still paired to data in our mysql database but because it is password protected the IRB didn't care.

pfeyz commented 9 years ago

I'm not sure how you can block the same worker from doing an experiment twice without recording the workerID with the data.

A possible technical solution to hiding the workerId from the psiTurk db while allowing it to check for duplicate participants is to store salted hashes of the workerID instead of the workerID itself.

If anybody's not familiar with hashing, a good hash function is in theory a one-way function, so there's no easy way to go from the hashed string, f(s), back to the string it was generated from, s, besides running every possible legal workerId through the hash function and looking for f(s) in the output.

deargle commented 8 years ago

+1 @pfeyz

decodyng commented 3 years ago

Resurrecting this, since it's currently a live issue for our project that uses Psiturk:

I understand that the maintainers may not have time to make this change, but I was wondering if someone more familiar with the library has thoughts on where one might be able to implement a hashed-ID solution (as suggested by @pfeyz ) and not interfere with the workers being paid properly (for which I'd assume their worker ID has to be present in its original, unhashed form)

deargle commented 3 years ago

"approving a worker" actually only requires the assignment id, see approve_assignment, so the raw worker id wouldn't be needed there -- but the worker_id and the assignment_id are both required to send bonuses, see boto send_bonus

It's not about not having time (well, it sort of is), but more about, as pfeyz noted, the theoretical impossibility of going back from a hash to the worker id.

If you only need to deidentify the data, you can hash the unique_id in whatever csv you share with colleagues or whatever. Then you can always get that hash again by referencing your raw unique_id / worker_id / assignment_id

decodyng commented 3 years ago

If you only need to deidentify the data, you can hash the unique_id in whatever csv you share with colleagues or whatever. Then you can always get that hash again by referencing your raw unique_id / worker_id / assignment_id

I think the issue is that it'd be nice to be able to say that we never save worker_id to disk; saying instead that "we save it to disk but then de-identify it before passing it around/doing further analysis" still leaves us with more complicated IRB explaining than we'd ideally like =/

deargle commented 3 years ago

Right, that doesn't work with the core underlying AMT api, assuming you ever want to pay a bonus. You could encrypt the worker_id instead of hashing it. Perhaps for your IRB purpose though, you could say that:

Warning, I teach information security management. But point is that there are mitigations to specific threat models that your IRB would probably be fine with. Everyone who does AMT research has the same issues -- not just psiturk users. So point is, rather than say you never store the worker_id, say that you store it securely. And if you share your data, say that you deidentify it. It's IRB-reasonable, in my experience.

jacob-lee commented 3 years ago

If you write your own export code you can replace the worker IDs with new arbitrary IDs when data is distributed for analysis or to fulfill data sharing agreements. Eg maintain a table with worker IDs and UUIDs, and let your export code first do a lookup for the worker is, and if not existing, generate a new uuid.

Jacob

On Fri, Feb 19, 2021, 2:51 PM Dave Eargle notifications@github.com wrote:

Right, that doesn't work with the core underlying AMT api, assuming you ever want to pay a bonus. You could encrypt the worker_id instead of hashing it. Perhaps for your IRB purpose though, you could say that:

  • your database is encrypted (is it?)
  • access to your database is protected by a strong username / password that is stored in ____ secure way (is it?)
  • while the worker_id is passed around in the URLs, those urls are encrypted assuming you're using https -- they're encrypted by a key negotiated between the client (browser) and server (psiturk server, not the ad server) using state-of-the-art "military grade" (eyeroll, military uses the same security tech as everyone else) technology... except that's not true if you're using the psiturk ad server and redirecting back to your experiment server which listens only on http. This is one reason I like hosting my own stuff on heroku -- the whole thing is https encrypted. threat model moves to someone at heroku snooping, but that's a lesser attack to a coffee-shop snooper or something.

Warning, I teach information security management. But point is that there are mitigations to specific threat models that your IRB would probably be fine with. Everyone who does AMT research has the same issues -- not just psiturk users. So point is, rather than say you never store the worker_id, say that you store it securely. And if you share your data, say that you deidentify it. It's IRB-reasonable, in my experience.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NYUCCL/psiTurk/issues/175#issuecomment-782303437, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM5SQP23RPHRFN77NBSZNDS726MDANCNFSM4A5WYUWA .

jacob-lee commented 3 years ago

I don't know. For ordinary in-lab studies PID is regularly collected and stored somewhere. How else would consent get done?

Jacob

On Fri, Feb 19, 2021, 2:28 PM Cody Wild notifications@github.com wrote:

If you only need to deidentify the data, you can hash the unique_id in whatever csv you share with colleagues or whatever. Then you can always get that hash again by referencing your raw unique_id / worker_id / assignment_id

I think the issue is that it'd be nice to be able to say that we never save worker_id to disk; saying instead that "we save it to disk but then de-identify it before passing it around/doing further analysis" still leaves us with more complicated IRB explaining than we'd ideally like =/

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NYUCCL/psiTurk/issues/175#issuecomment-782292039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAM5SQP2EMDHEYHQFNMRK6TS723WZANCNFSM4A5WYUWA .