DDD-Moore / early-career-hawaii

17 stars 9 forks source link

Data sharing & licensing data #12

Open strasser opened 7 years ago

strasser commented 7 years ago

Discussion/hacking on using open data and sharing data for others to use

dhimmel commented 7 years ago

I'm interested and can discuss the importance of openly licensing data.

lederman commented 7 years ago

I suggest we add to this discussion the reasons why people don't share code and data, and see for which problems we have solutions, and which are legitimate concerns.

We can also talk about the "dark pool" of failed experiments/ failed processing methods. If we are going beyond the paper into the raw work that led to it, these things are as important as the successful methods, but they are even more difficult to document and share. I have published some examples of cases where I get one of my algorithms to fail, but it is the first thing that I have to cut when I summarize the technical report to a paper. I have seen some reviews on such things (probably the most important paper I've seen in one area) - but these are done later, usually refer to other people's work, and they are rare. I wonder if people found good ways to share these "failed" experiments/ pipelines. To put in a larger context, this is related to how other disciplines use "cases" and "failures" to learn, and to some dangers in open benchmarks.

In the context of algorithms (and data associated with the algorithms), I mentioned CodeOcean in the previous meeting, I've been participating in their closed beta. As a user (reader) of a paper/ algorithm, you can just see all the data and code online, change it and run it without installing anything locally. I asked them for invitations for the participants, would there be a good way to distribute those invitations to anyone who is interested? (this part is also related to topics #11 and #6 )

dhimmel commented 7 years ago

@lederman I read a bit about code ocean -- it seems more focused on code and not data. I'm also interested in the reproducibility of computation. But, I'm not too keen on spending conference time to learn how to use a service that's not currently publicly available.

lederman commented 7 years ago

@dhimmel Regarding conference time and codeocean: it will be public soon, but I agree - that's why I haven't asked to present a demo, just to share the invitation with anyone who wants to try it before it is public.

Regarding data vs. code vs. reproducibility: I agree that codeocean is not a generic data repository or data exploration tool. In addition to reproducibility, I think that gives you an easy way to start working with published data because you can easily play with an entire pipeline used to generate the data, which is why I mention it in this context. I wouldn't mind classifying these issues under other categories.

lederman commented 7 years ago

@dhimmel it turns out that there are two codeoceans.... I'm talking about a different ocean out of Cornell Tech, the information about them isn't public yet. Sorry, I didn't notice the link you sent so I just realized this now.

strasser commented 7 years ago

turning into licensing and having daniel lead

strasser commented 7 years ago

Another angle: Ethics of Data Science (and data sharing)