bkatiemills / blog

http://billmills.github.io/blog/
MIT License
2 stars 2 forks source link

Comments on 'Truly Open Data' #1

Open bkatiemills opened 9 years ago

bkatiemills commented 9 years ago

Please reply to this thread with your comments on this post.

tleeuwenburg commented 9 years ago

Could you set up a BitTorrent tracker for data science data sets?

bkatiemills commented 9 years ago

@tleeuwenburg I've actually always liked that idea (doesn't require centralized hardware, no software development to do, 'versioning' through hashes etc), but it never really caught on, and I never really heard a convincing reason why not.

That said, torrents are just a distribution system - a big other issue to deal with is making that data easy to use and understand once it's sitting on your hard drive, which IMO is the real barrier to open data.

tleeuwenburg commented 9 years ago

One answer to your question is "schemas and standards". Another might be peer-reviewed data. I wonder if part of the issue is that a lot of the open data is coming from the science world rather than the engineering world? While you're looking at the legibility of data, I do still find transport and storage is a problem.

bkatiemills commented 9 years ago

You're certainly right that transport and storage is still a problem; nuclear physics, astronomy and increasingly genomics at least are producing datasets that aren't feasible to transmit. But a great many fields don't have such unwieldy datasets; we can work at the problem from both ends.

I'd like to understand better your ideas on the distinction between scientific and engineering cultures in this regard - I would have guessed they would suffer from the same challenges, but I'd be delighted to learn more.

tleeuwenburg commented 9 years ago

The difference to my mind is most evident in the 'open source' culture, where there is a strong concept of building shareable, re-usable multi-use content. Science is more focused on individual results rather on making repeated use of data. Therefore, science content is harder to use, in general, than engineering content. Data needs a stronger focus on re-usability and shareability.

bkatiemills commented 9 years ago

That's really interesting, and spot on; I'd love to see a blog post / lesson / lecture notes / whatever you prefer on lessons the sciences could learn from engineering culture on making that shareable, re-usable, multi-use content; I think there could be some great insights there!

tleeuwenburg commented 9 years ago

I think that's a good idea. Do you have an example dataset which you think would be a good basis, and be of common, general interest? If you'd be interested, I'd be happy to work up an example of what I mean, and then maybe get some help from you on whether you think it's a good approach? I would just want to make sure we are using good, open data with no major IP restrictions and which is clearly of common use. Archival or realtime, either is fine.

tleeuwenburg commented 9 years ago

Hey, so I created a pretend tool called 'odit' -- the Open Data Integration Tool, and wrote documentation for what it should do. http://odit.readthedocs.org/en/latest/ ... this is my imagining of what data sharing needs.

tleeuwenburg commented 9 years ago

Also, http://myownhat.blogspot.com.au/2015/07/shareable-datasets-functional-design.html

bkatiemills commented 9 years ago

Interesting, odit is definitely on the right track on the distribution side - have you checked out the dat project? It would be interesting to dig into how well that project conforms to odit's recommendations; at first glance I bet there will be some substantial overlaps.

As for datasets, there are a couple of options out there that could be interesting. I've been playing around a little with some environmental data from the British Columbian government; also of enormous popular interest is genomics data from NCBI and others; have a look at these steps to acquire some data being used in a real and ongoing genomics project.