Scifabric / pybossa

PYBOSSA is the ultimate crowdsourcing framework (aka microtasking) to analyze or enrich data that can't be processed by machines alone.
http://pybossa.com
GNU Affero General Public License v3.0
745 stars 268 forks source link

Research projects may want to keep results data private for a given time #218

Closed teleyinex closed 10 years ago

teleyinex commented 12 years ago

Problem: researchers would worry about a variety of things - from people trying to actively manipulate the data based on what others have put in, to people using the data without discussing it with the researchers - if everyone had access to it from the beginning. Solution: Give the owners of the application an option to hide the data from the web access and API

@PyBossa This issue could be really interesting to discuss about it, as this option could led to projects where researchers will not release any data after a given period of time, but at the same time not offering this feature could cause that other researchers will never consider the platform as they cannot control the data at all.

I guess this is question where we need to talk and discuss with the scientists and see how we can agree in a mid term solution, where the scientists will have control in the first time over the data for publishing papers, etc. but at the same time the contributions made by the volunteers should be respected and returned to them in one way or another. For example, PyBossa will provide an option to keep the data private with a maximum deadline of one year. After that deadline is met, the data will be made public, so the user contributions will not be lost.

This is really important, because, IMHO, there is no a simple solution for this. We may find researchers that will find ok to share the data from the beginning without problems, but there will be others that will be reluctant but will want to use the platform. If they do not have a bit of control about their data, they will go away and we will miss the opportunity to explain to them why is important to share the data since the very first moment. I guess PyBossa could be used as a tool to get researchers more open about open-science and open-data if we provide a good set of tools for all of them.

We also may need to consider that some experiments run by the researchers will need actually to not show any data until all the tasks are completed, so we need to basically discuss this a bit more with scientists and researchers.

In summary, I would like to open a discussion about this issue and how do you think we should approach it.

rufuspollock commented 12 years ago

I think there are 2 separate things:

  1. What PyBossa as software can support in terms of private features
  2. What PyBossa.com supports

My answers:

  1. I think it is worth at least speccing what the minimum private data feature is (my guess is: taskruns can be set to being visible only to: app owner, taskrun doer (+ sysadmins of course). One would do this via a checkbox on app which says something like: keep data private).
    • Note I would argue against making the feature more complex (e.g. trying to build in technologically stuff like time limits - you can just do these via normal discussion
  2. I think a policy of allowing time-limited privacy is an option (you make a good case for this). I think is something where people would need to make a good case and there would be a time limitation.
teleyinex commented 12 years ago

Hi Rufus,

For a quick hack I think that what you address in point 1 for PyBossa framework is good and simple. I've contacted the researchers and invited them to add their own comments here before we take any action. Thus, let's wait and see what they think, but I think that allowing them to "hide" the data in this way it will be good for them.

The time policy is an idea that I wanted to introduce, not to develop right away. The good thing is that I like that you like it :-)

tfmorris commented 11 years ago

It's a good idea to separate mechanism & policy. Policy should be decided by app creators or site administrators. In my opinion, attempting to enforce policy in an open source product is a losing battle (in addition to just being a bad idea to start with).

Researchers are going want to control "their" data as long as possible (while taking tax dollars and free volunteer labor to create "their" data in the first place), but it's not the place of a tool to attempt to change that. Policy discussions should take place with the research funders.

rossmounce commented 11 years ago

As originally mentioned there were 2 different potentially undesirable problems identified WRT data openness:

1: people trying to actively manipulate the data based on what others have put in & 2: [other] people using the data without discussing/clearing this usage with the researchers

problem 1 is a real problem IMO and I can see this being a problem in two particular ways. A) Users / spiteful research competitors might want to sabotage it in a particular direction. B) The researchers themselves might want the data to show 'stronger, more statistically significant effects' and want to bias the data collected themselves based upon a sneak-peek data inspection midway through the experiment. Of course they can do this a priori without having seen any of the data, but seeing how data comes in may allow them further insight into how to introduce 'realistic' fake/fraudulent user data.

Case B highlights that for the most stringent untamperable experimental design the data/results should really be closed/non-accessible for the duration of the experiment to even the researchers themselves who created the app! c.f. "double blind trials" Perhaps just provide them open metadata e.g. how many different users so far, number of unique IP addresses so far etc...

problem 2 seems to be the classic 'scooping' problem albeit in a data setting. I don't necessarily see this as a real problem of the software though. Scooping is a problem of ethics and the research communities, editorial boards and societies that should police and enforce ethics & codes of conduct. If researchers setup an experiment on crowdcrafting and some other person(s) scoop / publish an analysis of the data from it without acknowledging the provenance or permission of the original researchers -- that's seriously unethical behaviour and poor scholarship. Furthermore I would think it would be easy to prove that the data was used without permission because of the openness of the project.

Therefore whilst problem 2 is theoretically possible, and some researchers would perhaps perceive this to be problematic - I don't think it's a problem of the software - it's a people problem, and one that community norms should prevent from happening in the first place, and punish if it does happen to occur. Some light 'security through obscurity' might help alleviate perceptions of danger here - leave the data open, but don't make it easy to find perhaps(?)

PS something to think about for the long-term: DOI's seem to have mythical status in science. If you assign something a DOI like an arXiv pre-print or pre-publication data on figshare, no-one will dare to use/scoop it without permission. So if you're looking for alternate ways of re-assuring users that they won't be scooped, DOI's are pretty bulletproof in that respect. In reality it shouldn't need a DOI, and they aren't all that special in many respects, but it's a perception thing y'know ;)

teleyinex commented 11 years ago

@rossmounce thanks a lot for your ideas! I like them a lot!!

I think that what we can do is the following:

The double blind trials are a really good idea that we should discuss a bit more. As you have said, it will help with the quality of the analysis, but we will have the same problem discussed here: some researchers will not like that the data will be publicly available once the trials are finished. Your idea of using a DOI I think it is really valuable, as it should provide a "visual clue" that you cannot use the data unless you cite it. Thus, what about that if you choose a "double blind" type project, once the trial has finished PyBossa requests a DOI for your data set?

What do you think?

rossmounce commented 11 years ago

A choice of two types of experiments is definitely a good idea. For many the double blind may be unnecessary, that's only needed if it's human/user psychology/perception being tested. If it's just transcribing PDFs or other such tasks then that complication clearly isn't necessary and perhaps obstructive.

DOI's have a cost though. To avoid PyBossa paying this cost perhaps the final completed dataset could be automatically deposited on figshare on behalf of the user by PyBossa (doing so will provide a DOI via figshare). I would be nice to make this final completed dataset completely uneditable when publicly exposed. This is the 'raw' data collected, no fiddling/tampering possible. The researchers can do whatever post-processing & filtering they want on this dataset but the original remains untamperable by them and available for reviewers / other researchers via the permanent DOI link. Figshare has an API by which I believe data can be deposited programmatically.

Alternatively, dump the end 'raw' untampered dataset on http://datahub.io/ ?

teleyinex commented 10 years ago

Closing as we will integrate this type of knowledge in issue #694