datatogether / dataset_registries

Tracking the design and implementation of the metadata registries we will use to track rescued datasets
Creative Commons Attribution Share Alike 4.0 International
3 stars 0 forks source link

Compare the Metadata that Different Groups have about the Datasets They have Rescued #1

Open flyingzumwalt opened 7 years ago

flyingzumwalt commented 7 years ago

In order to design a simple, practical, first-pass metadata format for tracking these datasets.

Information we will want the registries to keep track of:

Note: a lot of datasets have been downloaded multiple times by different people. We need to represent "these are both versions of the same dataset" without losing info about where they are and who downloaded them.

flyingzumwalt commented 7 years ago

Here is a sample of the download stats that @maxogden collects for his downloads https://www.irccloud.com/pastebin/RgbAui2I/

{
  "url": "http://www.dot.gov/regulations/significant-rulemaking-report-archive",
  "date": "2017-01-29T03:34:05.844Z",
  "headersTook": 4232,
  "package_id": "27949aef-ad78-4a56-8d95-eb2f3943d3bf",
  "id": "366cf35f-24f0-4b14-ba5e-78fbccf8ab6c",
  "status": 200,
  "rawHeaders": [
    "Content-Language",
    "en",
    "Content-Type",
    "text/html; charset=utf-8",
    "ETag",
    "\"1485633771-1\"",
    "Last-Modified",
    "Sat, 28 Jan 2017 20:02:51 GMT",
    "Link",
    "<https://www.transportation.gov/regulations/significant-rulemaking-report-archive>; rel=\"canonical\",<https://www.transportation.gov/node/1485>; rel=\"shortlink\"",
    "Server",
    "nginx",
    "X-Age",
    "0",
    "X-AH-Environment",
    "prod",
    "X-Drupal-Cache",
    "HIT",
    "X-Frame-Options",
    "SAMEORIGIN",
    "X-Generator",
    "Drupal 7 (http://drupal.org)",
    "X-Request-ID",
    "v-c9206ff0-e5d3-11e6-983e-22000b4183e0",
    "X-UA-Compatible",
    "IE=edge,chrome=1",
    "X-Varnish",
    "505586209",
    "Cache-Control",
    "public, max-age=3503",
    "Expires",
    "Sun, 29 Jan 2017 04:32:28 GMT",
    "Date",
    "Sun, 29 Jan 2017 03:34:05 GMT",
    "Transfer-Encoding",
    "chunked",
    "Connection",
    "keep-alive",
    "Connection",
    "Transfer-Encoding",
    "Strict-Transport-Security",
    "max-age=31622400"
  ],
  "headers": {
    "content-language": "en",
    "content-type": "text/html; charset=utf-8",
    "etag": "\"1485633771-1\"",
    "last-modified": "Sat, 28 Jan 2017 20:02:51 GMT",
    "link": "<https://www.transportation.gov/regulations/significant-rulemaking-report-archive>; rel=\"canonical\",<https://www.transportation.gov/node/1485>; rel=\"shortlink\"",
    "server": "nginx",
    "x-age": "0",
    "x-ah-environment": "prod",
    "x-drupal-cache": "HIT",
    "x-frame-options": "SAMEORIGIN",
    "x-generator": "Drupal 7 (http://drupal.org)",
    "x-request-id": "v-c9206ff0-e5d3-11e6-983e-22000b4183e0",
    "x-ua-compatible": "IE=edge,chrome=1",
    "x-varnish": "505586209",
    "cache-control": "public, max-age=3503",
    "expires": "Sun, 29 Jan 2017 04:32:28 GMT",
    "date": "Sun, 29 Jan 2017 03:34:05 GMT",
    "transfer-encoding": "chunked",
    "connection": "keep-alive, Transfer-Encoding",
    "strict-transport-security": "max-age=31622400"
  },
  "downloadTook": 4579,
  "file": "5838d1071ae7c3fee63d2c425d89d799ff4e9bee6ec3f99643952a3a9267febe"
}
mejackreed commented 7 years ago

A piece of metadata I have: https://gist.github.com/mejackreed/cee25feea0c0b1d9602e38bc9479a61d

Files downloaded from resources are also accompanied by headers from the download.

dcwalk commented 7 years ago

pinging @b5 and @danielballan RE: DataRescue metadata

dcwalk commented 7 years ago

Also! Just to note-- the vetting process and posting for the DataRefuge CKAN is handled by DataRefuge, their input could speak to those areas of metadata that are generated through vetting workflow (cc @rlappel and @jschell42 are you the right people to ping on this?)

flyingzumwalt commented 7 years ago

@b5 we've got two of them! the challenge here is to build metadata registries about what's there, regardless of which system is used to store them. Those metadata registries will certainly contain ways to find the datasets over p2p networks so we can also do things like coordinate clusters of nodes that want to replicate a given dataset, but we need the registry in order to support basic activities like keeping an inventory of what datasets we've rescued, their provenance, what's in them, and who's holding them.

flyingzumwalt commented 7 years ago

@b5 could you post an example of the metadata you capture for datasets downloaded at a #datarescue hackathon?

titaniumbones commented 7 years ago

@mejackreed is that metadata gist idiosyncratic to you, or is it produced in accordance with the standards of a wider community (climate mirror, azimuth, etc)?

ambergman commented 7 years ago

@flyingzumwalt - With @b5, @trinberg and others, there have been having some early conversations - and I'm confident I'm not the one who should be having or reporting on them - about how to coordinate nodes in the short term, so that replicated datasets can have additional metadata added locally, but still all be able to reference one another and access additions. You comment about being agnostic to the storage system would definitely be a part of that. The long term should, of course, look different (and I won't even try to pretend I really fully understand IPFS here :) ), but it would be great to discuss the short term at the event today

b5 commented 7 years ago

So, we have a bit of a problem, and are going to need to rethink our practices if we will be able to properly coordinate with other archiving efforts. I have a solution in mind, but it has sweeping implications for our practices to date.

Here's what our current base-schema for metadata collection looks like:

{
 "Individual source or seed URL": "http://www.eia.gov/renewable/data.cfm",
"UUID" : "E30FA3CA-C5CB-41D5-8608-0650D1B6F105",
"id_agency" : 2,
"id_subagency": ,
"id_org":,
"id_suborg":,
"Institution facilitating the data capture creation and packaging": "Penn Data Refuge",
"Date of capture": "2017-01-17",
"Federal agency data acquired from": "Department of Energy/U.S. Energy Information Administration",
"Name of resource": "Renewable and Alternative Fuels",
"File formats contained in package": ".pdf, .zip",
"Type(s) of content in package": "datasets, codebooks",
"Free text description of capture process": "Metadata was generated by viewing page and using spreadsheet descriptions where necessary, data was bulk downloaded from the page using wget -r on the seed URL and then bagged.",
"Name of package creator": "Mallick Hossain and Ben Goldman"
}

There are additional fields added by certain automated tools. All of this information is stored & coordinated along UUIDS, it's a bit spread out, but will be trivial to assemble thanks to good old uuid's.

There's nothing wrong with the metadata we're gathering, the problem lies in the metadata we aren't gathering. Both of the examples from other organizations list a one-to-one relationship between a url and it's content, our approach lacks this mapping, and will prevent us from effectively coordinating.

In our current process the "urls" brought into our pipeline actually represent a one-to-many relationship of sub-urls that the given page links to. When a volunteer logs into the app, they are given a url as a starting point, and then download all of the content that the page links to. So in our current setup 1 "url" will result in many static files. Because we don't dictate strict methods to volunteers on how to archive data, we have no dependable method for associating data within an uploaded zip archive and it's url of origin.

This has some dramatic implications for our archiving process, namely that if we are to coordinate efforts, we will need to programmatically archive data, instead of through volunteer-driven efforts.

I want to say that while this may seem like a bad thing, I think it is in fact a very good thing. I think this is just the nudge we need to move away from having volunteers download data (a task that is quite frankly better performed by a computer) to having volunteers at our events contextualize data (an inherently human task). Instead of asking volunteers to engage in downloading, we would hand them already-archived data, and ask them to enrich the metadata & context that is lost in the archiving process. This absolves us of many chain-of-custody issues for the data itself, gives us higher-integrity data, and allows us to engage with the broader archiving community. It would give me great joy to ask a volunteer to learn about & document an already-archived dataset instead of spending hours troubleshooting s3 credentials.

With that, I'm heading to Boston today to think this over with others and begin conceptualizing changes to our approach to match the efforts of our peer organizations. Growth can sometimes be painful, but I for one am extremely excited at the prospect of growing our process to have more hands make for lighter lifting.

mhucka commented 7 years ago

I think this is just the nudge we need to move away from having volunteers download data (a task that is quite frankly better performed by a computer) to having volunteers at our events contextualize data (an inherently human task). Instead of asking volunteers to engage in downloading, we would hand them already-archived data, and ask them to enrich the metadata & context that is lost in the archiving process.

That's a great goal. I know when I was leading people in seeding URLs during the UCLA event in January, I felt a bit like asking them to do really menial stuff that a computer should be doing.