Reconciling with `package.jsonld`

frictionlessdata / datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.

https://datapackage.org

The Unlicense

488 stars 112 forks source link

Reconciling with `package.jsonld` #110

Closed jbenet closed 8 years ago

jbenet commented 10 years ago

Hey guys!

I'm the author of datadex, and now working with @maxogden on dat. As a package manager for datasets, datadex uses a package file to describe its datasets. Choosing between data-package.json and package.jsonld is hard:

On the one hand, data-package.json has been around longer, has a well defined spec, and many packages use it.
On the other, package.jsonld takes into account jsonld (which came out recently), and plugs into schema.org's schemas, for linked-data goodness. And at first glance, it seems most of what's in data-package.json is in package.jsonld.

It's confusing for adopters to have two different specs. I think we should reconcile these two standards and push forward with one. Thoughts? What work would it entail?

To ease transition costs, I'm happy to take on the convergence work if others are too busy. Also, I can write a tool to convert between current data-package.json and package.jsonld and whatever else.

Cheers!

cc @rgrp, @maxogden, @sballesteros

sballesteros commented 10 years ago

Hi!

What we love about JSON-LD is that it can be seen as one serialization of RDF and can therefore be converted in RDFa and therefore directly inserted into HTML documents. It opens some cool possibilities, like you are reading a New York Times article for instance and you can ldpm install it and start hacking on the data. Everything your data package manager needs to know is directly embedded into the HTML! To me, being able to embed a package.json-like-thing into a webpage, respecting open web standards, is amazing. Regarding schema.org, our "dream" is to be able to leverage the web as the registry using some markup already being indexed by the major search engines (google, yahoo!, yandex and bing). Check http://datasets.schema-labs.appspot.com/ for instance.

I would encourage anyone interested in that to go read the JSON-LD spec and the RDFa Lite spec. Both are super well written. The RDFa Lite spec in particular is remarkably short.

That being said, we are still experimenting a lot with that approach and 100% agree that soon enough we should work on merging all of that (and happy to contribute to the work)...

Another thing to follow closely is: CSV-LD.

sballesteros commented 10 years ago

Forgot to mention but for datatypes and co http://www.w3.org/TR/xmlschema-2/#built-in-datatypes is here to help (and can prevent re-inventing a spec for datatypes).

rufuspollock commented 10 years ago

@jbenet great to hear from you and good questions. Obviously my recommendation here would be that we converge on datapackage.json - I should also say that @sballesteros has been a major contributor to the datapackage.json spec as it stands :-)

I note there are plans to introduce a few json-ld-isms (see #89) into datapackage.json but the basic aim is to keep this as simple as possible and pretty close to the commonjs package spec. Whilst I appreciate RDF's benefits (I've been a heavy RDF user in times past) I think we need to keep things super-simple if we are going to generate adoption - most data producers and users are closer to the Excel than the RDF end of the spectrum. (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that).

That said the differences seem pretty minor here in the basic outline so with a small bit of tweaking we could have compatibility.

@sballesteros I note main differences seem, at the moment, to be:

rename of 'resources' to 'datasets'
rename of 'path' to 'distribution'
'code' element (currently 'scripts' is being used informally in datapackage.json echoing the approach in node)

If we could resolve these and perhaps define a natural enhancement path from a datapackage.json to become "json-ld" compliant we could have a common base - those who wanted full json-ld could 'enhance' the base datapackage.json in their desired ways but we'd keep the simplicity (and commonjs compatability) for non-RDF folks.

wdyt?

rufuspollock commented 10 years ago

@jbenet more broadly - great to see what you are up to. Have you seen https://github.com/okfn/dpm - the data package manager? That seems to have quite a bit in common with the data tool I see you are working on. Perhaps we could converge efforts there too?

There's also a specific issue for the registry at https://github.com/okfn/dpm/issues/5 - the current suggestion had been piggy-backing on github but I know we also have options in terms of CKAN and @sballesteros has worked on a couchdb based registry.

sballesteros commented 10 years ago

I would said that given that using the npm registry is no longer really an option, alignment with schema.org is more interesting than commonjs compatibility but I am obviously biased ;)

A counter argument to that would be dat: @maxogden, do you know how dat is going to leverage transformation modules (do you plan to use the scripts property of commonJS ?)

To me alignment with schema.org => we can generate a package.jsonld from any webpage with RDFa markup (or microdata). You can treat JSON-LD as almost JSON (just an extra @context) and in this case there is no additional complexity involved and no need to mention / know RDF at all.

JDureau commented 10 years ago

Hey all,

Another argument in favour of a spec supporting JSON-LD and aligned with schema.org is explorability. Being able to communicate unambiguously that a given dataset/resource deals with http://en.wikipedia.org/wiki/Crime and http://en.wikipedia.org/wiki/Sentence_(law) for example, goes a longer way than keywords and a description. It makes it query-ready.

"dataset": [
  {
    "name":  "mycsv",
    "about": [
      { 
        "name": “crimes”,
        "sameAs": "http://en.wikipedia.org/wiki/Crime"
      },
      { 
        "name": “sentences”,
        "sameAs": "http://en.wikipedia.org/wiki/Sentence_(law)
      }
    ],
    ...
  }
]

jbenet commented 10 years ago

@rgrp thanks for checking this out!

@rgrp said:

Whilst I appreciate RDF's benefits ... I think we need to keep things super-simple if we are going to generate adoption ... (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that).

Strong +1 for simplicity and ease of use for end users. My target user is the average scientist. Friction (in terms of having to learn how to use tools, or ambiguity in the process) is deadly.

I don't think that making the format json-ld compliant will add complexity beyond that ease of use. JSON-LD was designed specifically the smallest overhead, that still provides data-linking power. I found these blog posts from Manu (the primary creator), quite informative:

If I were building any package manager today, I would aim for JSON-LD as a format, seeking the easiest (and readable-ish) integration to other tools. I think JSON is already "difficult to read" to non-developers (hence people's use of YAML and TOML, which are inadequate for other reasons), the JSON-LD @context additions don't seem to make matters significantly worse.

I think even learning the super simple NPM package.json is hard enough for most scientists -- as a prerequisite to publishing their data. I claim the solution is to build a very simple tool (both CLI and GUI) that guides users through populating whatever package format we end up converging in. My data tool already does this, though that could still be even simpler. GUIs will be important for lots of users.

@rgrp I found your dpm after I had already built mine. We should definitely discuss converging there. I'm working with @maxogden and we're building dat + datadex to be interoperable. Also, one of the use cases I care a lot about is large datasets (100GB+) in Machine Learning and Bioinformatics. I'm not sure how much you've digged into how data and datadex work, but it separates the data + the registry metadata, such that the registry can be very lightweight, and the data can be retreived directly from S3, Google drive, or peers (yes, p2p distribution). The way it works right now will change. Let's talk more off-band. I'll send you an email :)

@sballesteros what do you think of the differences @rgrp pointed out? Namely:

rename of 'resources' to 'datasets'

rename of 'path' to 'distribution'

'code' element (currently 'scripts' is being used informally in datapackage.json echoing the approach in node)

IMO:

resources is more general.
distribution seems more general. @sballesteros what else (other than contentPath) could go here?
the code in package.jsonld seems to be more descriptive than scripts, however, scripts is simple and works well already. not sure.

And, @sballesteros, do you see other differences? What else do you remember being explicitly different?

Let's try to get convergence on these :)

sballesteros commented 10 years ago

@jbenet before diving into the small differences and trying to converge somewhere, I think we should really think of why we should move away from vocabularies promoted by the W3C (like DCAT). To me, schema.org has already done a huge amount of work to try to bring as much pragmatism as possible in that space see: http://www.w3.org/wiki/WebSchemas/Datasets for instance.

Why don't we join the W3C mailing lists and take action there so that new properties are added if we need them for our different data package managers?

The way I see it is that unlike npm and software package manager, for open data, one of the key challenge is to make data more accessible to search engines (there are so many different decentralized data publishers out there...). Schema.org is a great step in that direction so in my opinion it is worth the little amount of clunkiness in the property names that it imposes. Like you said, a GUI is going to be needed for beginners anyway so the incentive to move away from W3C standard for easier to read prop. name is low.

Just wanted to make that clear but all that being said, super happy to go in convergence mode.

rufuspollock commented 10 years ago

Let's separate several concerns here:

A. MUST datapackage.json be valid JSON-LD? Note, nothing prevents someone enhancing datapackage.json to be full JSON-LD - the spec specfically allows new attributes. As such question is: should we require datapackage.json always to be valid JSON-LD?
- e.g. @JDureau nothing prevents you doing what you want already with datapackage.json :-) i.e. you can add those attributes if you want (the question is MUST everybody do it ...)
B. If so what is required for datapackage.json to be valid JSON-LD. (Could someone specific exaclty what would be needed here? Is it that there must be an @context and @id (?).)
C. DCAT "compatability". There isn't a specific serialization of DCAT to JSON so we have some flexibility here since we ultimately going to have to do a bit of translation to DCAT / schema.org whatever we do. I note here that I had significant input into the DCAT spec - including being the person originally responsible for the "distribution" terminology (which came from early CKAN where it came from python setup.py stuff). I think distribution is ultimately wrong for what we are trying to describe here and argued for that (sadly unsuccessfully ;-) ...) in later versions of DCAT. However, all that said I'm not too hung up here on the terminology - my biggest concern is breaking changes give existing adoption. My preference would be:
- resources over datasets (this does not really occur in the DCAT spec and i think datasets is confusing since what you are describing by the datapackage.json is a dataset)
- I don't mind if we converge path + url into distribution- though it means a bit of effort for parsers (e.g. to work out if something is a relative path)
- code vs scripts - not in the spec properly yet so happy either way frankly (we could even support both in the interim)

@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). dpm is intended to continue to respect that setup so it sounds like we are very close here :-)

@sballesteros (aside) I'm not sure accessibility to search engines is the major concern in this - the concern is integration with tooling. We already get reasonable discovery from search engines (sure its far from perfect, but its no worse than for code). Key here for me is that data is more like code than it is like "content". As such, what we most want is better toolchains and processing pipelines for data. As such the test of our spec is now how it integrates with html page markup but how it supports use in data toolchains. As a basic test: can we do dependencies and automated installation (into a DB!).

jbenet commented 10 years ago

A. MUST datapackage.json be valid JSON-LD?

I think yes. I understand the implications of MUST semantics, and the unfortunate upgrading overhead costs it imposes. But without requiring this, applications cannot rely on a package definition being proper linked-data. They require data-package.json specific parsing. In a way, it constrains the reach of the format. (FWIW, JSON-LD is a very pragmatic format.)

To better understand the costs of converting exiting things, it would be useful to get a clear picture of the current usage of data-package.json. I see datahub.io and other CKAN instances. @rgrp Am I correct in assuming all of these use it? What's the completeness of the CKAN instances list (i.e. is that a tiny or large fraction)?

B. What is required for datapackage.json to be valid JSON-LD?

I believe @context and @id is enough for valid JSON-LD, though the spec defines more that would be useful. I'm new to it, so I'll dive in and post back here with a straight JSON-LD-fication of data-package.json. In the meantime, @sballesteros what's your answer to this? What else did you have to and get to use?

C. DCAT "compatability".

Relevant mappings between DCAT and Schema.org. I'm new to DCAT, so can't comment on its vocabulary, beyond echoing "let's try not to break compatibility unless we must." @sballesteros ?

@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). dpm is intended to continue to respect that setup so it sounds like we are very close here :-)

Sounds great! I care strongly about backing up everything, in case individuals stop maintaining what they published. IMO, what npm does is exactly right: back up published versions, AND link to the github repo. Data is obviously much more complicated, given licensing, storage, and bandwidth concerns. I came up with a solution-- more on this later :).

accessibility to search engines

I don't particularly care much about this either. Search engines already do really well (and Links tend to be the problem, not format). IMO a JSON-LD format that uses either an existing vocabulary or one with good mappings will work well. @sballesteros what are your concerns here?

rufuspollock commented 10 years ago

@jbenet thanks for the responses which are very useful.

On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO.

This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity.

jbenet commented 10 years ago

@rgrp

@context will probably be the same in all documents. It's important to have though, for non-data-package-specific JSON-LD enabled apps.
@id will be unique per dataset per version.

I think these could be filled in automatically by dat, data, dpm, and ldpm before submission. That way people don't have to worry about the conceptual complexity of understanding RDF.

For instance, say I have a dataset cifar that I want to publish to datadex. It's version is named 1.1-py. Based on how datadex works now, this dataset's unique id is jbenet/cifar@1.1-py (i call this a handle, this is datadex specific). dat, dpm, and ldpm also have namespace/versioning schemes that uniquely identify versions. So, can use that as an @id directly. On submission, the tool would automatically fill in:

{
  @context: "<url to our>/data-package-context.jsonld",
  @id: "http://datadex.io/jbenet/cifar-100@1.0-py",
  ... // everything else
}

rufuspollock commented 10 years ago

@jbenet I must confess I still think this is unnecessarily burdensome addition as a requirement for all users. As I said there's no reason users or even a given group cannot add these to their datapackage.json but this adds quite a bit of "cognitive complexity" for those users who are unfamiliar with RDF and linked data.

There are very few required fields at the moment in datapackage.json and anything that goes in has to be seen as showing a very strong benefit over cost (remember each time we add stuff we make it more likely people either won't use it or won't actually produce valid datapackage.json).

Whilst I acknowledge that quite a lot (perhaps most) datapackage.json will be created by tools I think some people will want to edit by hand (and want to understand the files they look at). (I'm an example of a by-hand editor ;-) ...)

jbenet commented 10 years ago

anything that goes in has to be seen as showing a very strong benefit over cost

Entirely agreed. Perhaps the benefits of ensuring every package is JSON-LD compliant aren't clear. Any program that understands JSON-LD would then be able to understand datapackage.jsonld automatically. Without the need for human intervention (writing parsers, running them, telling the program how to manipulate this format, etc). This is huge. On the same -- or greater -- of importance than having a version.

This video is aimed towards a very general audience, but still highlights the core principles: https://www.youtube.com/watch?v=vioCbTo3C-4

Many people have been harping on the benefits of linking data for over a decade, so I won't repeat all that here. The JSON-LD website and posts by @msporny highlight some of the more pragmatic (yay!) reasoning. Will note that it only works for the entire data web if the context is there (as the video explains). That's what enables programs that know nothing at all about this particular format to completely understand and be able to process the file. Think of it as a link to a machine-understandable RFC spec that teaches the program how to read the rest of the data. (without humans having to program that knowledge in manually).

I think some people will want to edit by hand (and want to understand the files they look at).

Absolutely, me too. But imagine it's your first time looking at datapackage.json. Does

{
  "name": "a-unique-human-readable-and-url-usable-identifier",
  "datapackage_version": "1.0-beta",
  "title": "A nice title",
  "description": "...",
  "version": "2.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }]
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "contributors":[ {
    "name": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    # like contributors
  }],
  "publishers": [{
    # like contributors
  }],
  "dependencies": {
    "data-package-name": ">=1.0"
  }
  "resources": [
    {
    }
  ]
}

Look much better than

{
  "@context": "http://okfn.org/datapackage-context.jsonld",
  "@id": "a-unique-human-readable-and-url-usable-identifier",
  "datapackage_version": "1.0-beta",
  "title": "A nice title",
  "description": "...",
  "version": "2.0",
  "keywords": ["name", "My new keyword"],
  "licenses": [{
    "url": "http://opendatacommons.org/licenses/pddl/",
    "name": "Open Data Commons Public Domain",
    "version": "1.0",
    "id": "odc-pddl"
  }]
  "sources": [{
    "name": "World Bank and OECD",
    "web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
  }],
  "contributors":[ {
    "name": "Joe Bloggs",
    "email": "joe@bloggs.com",
    "web": "http://www.bloggs.com"
  }],
  "maintainers": [{
    # like contributors
  }],
  "publishers": [{
    # like contributors
  }],
  "dependencies": {
    "data-package-name": ">=1.0"
  }
  "resources": [
    {
    }
  ]
}

I would imagine thinking things like:

Why are there two versions?
What is a datapackage_version ?
What is the -beta part?
Why do licenses have an object describing them? what other values could go there?
Same for contributors, maintainers, publishers.
What is the difference between contributors, maintainers, publishers anyway?
What bank of words count as keywords? anything?
What are the differences between name, title, and description? What will go where?

The latter adds:

What is that funky @context thing?

IMO, these would involve looking up the spec and understanding how the format works. I care a lot about readability (i originally had picked yaml for datadex) But i claim readability for new users is not affected significantly here. :)

msporny commented 10 years ago

@rgrp wrote:

On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO. This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity.

@id is not required for a valid JSON-LD document. Also note that you can alias "@id" to something less strange looking, like "id" or "url", for instance. The ID doesn't need to be a valid RDF class. The only thing that's truly required to transform a JSON document to a JSON-LD document is one line - @context. None of your users need to be burdened w/ RDF or Linked Data concepts unless they want to be. Just my $0.02. :)

junosuarez commented 10 years ago

Weighing in briefly after being directed to this thread by @maxogden. I am not currently developing any tools, but rather looking for forward-thinking best practices around metadata for datasets I'm working with a city to help publish, so in that sense I am your end user.

From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward.

Pertinent to this thread: given what @msporny said about being able to alias @id, and given the cognitive overhead of datapackage_version which @jbenet mentioned, would it be possible to use @context to also indicate the version number of the datapackage spec? eg:

{
  "@context": "http://okfn.org/datapackage-context.jsonld#0.1.1",
  "id": "http://dathub.org/my-dataset",
  "title": "my dataset",
  "version": "1b76fa0893628af6c72d7fa7a6c10f8e7101c31c"
}

In my example, I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number.

jbenet commented 10 years ago

From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward.

:+1: Thank you. I will quote this in the future. :)

would it be possible to use @context to also indicate the version number of the datapackage spec?

Yeah, absolutely. It's any URL, so you can embed a version number in the url and thus identify a different @context. And, great point. We should establish a culture of doing that. I don't think we should require it, as i can imagine cases where it would be more problematic than ideal (not to say how hard and annoying it would be to impose a version scheme on others).

"http://okfn.org/datapackage-context.jsonld#0.1.1"

I believe JSON-LD can extract the right @context from a #<fragment>, though not 100% sure. @msporny will know better. If not, embed it in the path:

"http://okfn.org/datapackage-context@<version>.jsonld"
"http://okfn.org/datapackage-context.<version>.jsonld"
"http://okfn.org/datapackage-context/<version>.jsonld"
"http://okfn.org/<version>/datapackage-context.jsonld"

I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number.

:+1: hash versions ftw. What are you building? And i encourage you to allow tagging of versions. the rightest thing i've seen is to have hashes (content-addressing) identify versions, and allow human-readable tags/symlinks (yay git).

jbenet commented 10 years ago

@rgrp thoughts on all this? Can we move fwd with @context or do you still think it is inflexible? Happy to discuss it more/make the case through a higher throughput medium (+hangout?).

@sballesteros if we have @context here, does that satisfy your constraints? (given that your own @context could alias the naming the names you're currently using in package.jsonld to those used in the standard @context).

rufuspollock commented 10 years ago

@jden great input.

@jden @jbenet re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely). People rarely add it and I'm doubtful it would be reliably maintained in which case its value to consumers rapidly falls towards zero. (I was sort of doubtful when first added but there were strong arguments in favour by others at the time).

Re the general version field I note that semver allows using hashes a sort of extra e.g. 1.0.0-beta+5114f85...

However, I do wonder about using version field at all if you are using full version control for the data - I imagined the version field being more like version field for software packages where its increment means something substantial (but where you can individual revisions if you want from the version control system - cf the node.js package.json where dependencies can refer either to package versions or to specific revisions for git repos).

jbenet commented 10 years ago

re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely).

I agree, let's remove datapackage_version and put a version number in the url of our @context. That's the LD way of doing it. FWIW, having a version is a good thing, and this way (in the url) we get seamless enforcement without the additional complexity of an optional field). And it's pretty good: the version isn't just a number to lookup somewhere; it points directly to a different file. :)

However, I do wonder about using version field at all

Having versions in pacakage-managers/registries is really useful. Let's not remove this. The package manager websites want to show meaningful descriptions of important version changes (semver). Users can understand the difference between 1.0.3 and 2.6.4 (one's newer) and conclude, usually correctly, which one is better. Git, which is full version control, makes extensive use of tags/branches (which are both just named version pointers). Hence I recommended to @jden to allow tagging :).

semver

By the way-- i'm not sure if you came up with something similar, but I put a tiny bit of thought into making a data-semver https://github.com/jbenet/data/issues/13 which might be useful. Clearly expressing what constitutes a MAJOR.MINOR.PATCH version change in the data context will help avoid confusion for people working with data that don't understand the subtleties of code semver.

@rgrp can we go fwd with @context? (Lmk if you need more time for consideration-- just wanting to get closer to done with this as we'll be using data packages soon and would like to have things resolved before that happens :) ).

rufuspollock commented 10 years ago

On the @context question: let me reiterate that I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment (name).

Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources). Again i think this is great if you can do it since you get much richer info - and data package has been designed so you can do this progressive enhancement really easily (just add the @type to your resource schema) but I don't think it should be required for everyone.

rufuspollock commented 10 years ago

@jbenet to be clear i wasn't suggesting removing version - i was saying i wasn't sure about using it for sha hash of a changeset (since as @jden mentions that changes so much). I think version is super-useful and isn't going anywhere. As you say the primary use for version (IMO) is more like the tags one has in git.

Also note I wrote my previous comment before I'd read your response. My suggested approach at present is that we add @context to the spec with a clear description of use but that we don't make it a MUST. If you were able to draft a description for the @context field to include that would be great and we could then review and slot it in the appropriate place.

msporny commented 10 years ago

I believe JSON-LD can extract the right @context from a #, though not 100% sure. @msporny will know better. If not, embed it in the path:

No, JSON-LD will not extract the "right" context from a #fragment :). We considered that option and felt that it adds unnecessary complexity (when a simpler alternative would solve the same problem). Just do this if you want to version the context:

"@context": "http://okfn.org/datapackage-context/v1.jsonld"

You are probably going to want to use a URL redirecting service so that your developers don't see breaking changes if okfn.org ever goes away. For example, use https://w3id.org/ and make this your context URL:

https://w3id.org/datapackage/v1

This does three things:

1) It decouples the hosting location of the actual file from the identifier that people type out to use the context. So, if you decide to change the hosting location from okfn.org to some other hosting provider, none of the programs using the context are affected. 2) It gives people an easy-to-remember URL for the JSON-LD Context. 3) It provides a hint to clients as to the version of the vocabulary you're using.

I can add it to the site in less than a minute if you want (or you can submit a pull request). w3id.org is backed by multiple companies and is designed to be around for 50+ years. You can learn more about it by going here: https://w3id.org/

(edit: fixed UTF-8 BOM - no idea how that got in there)

paulfitz commented 10 years ago

Somehow the w3id.org homepage link at the end of https://github.com/dataprotocols/dataprotocols/issues/110#issuecomment-41914984 is broken for me due to a utf8 bom that's crept in? Source code shows it as https://w3id.org/%EF%BB%BF. Strange. https://w3id.org/ works.

junosuarez commented 10 years ago

@jbenet @rgrp Here's an example dataset I'm building: https://github.com/jden/data-bike-chattanooga-docks

Some thoughts from the experience (albeit tangential to this thread):

0) all of this metadata was created by hand, without tooling. What I've filled out is about as far as I got before moving on. I am still interested in creating schemas to describe the csv and geojson representations of the data, but haven't looked into it yet.

1) it seems silly having both package.json and package.jsonld at the root level of the directory. I'm torn between whether I'd prefer npm to be package.jsonld aware and just parse package.jsonld or to have some other place to put package.json. From a package user experience point of view, I really want someone to be able to rebuild my data from git cloning, npm installing, npm starting.

2) How would I indicate that this package has two representations of the same resource (data.csv and data.geojson), as opposed to for example two separate-but-related tables? From a REST background, my inclination would be to do something like

"resources": [
  {
    "name": "data",
    "mediatype": "text/csv",
    "path": "data.csv"
  },
  {
    "name": "data",
    "mediatype": "application/json",
    "path": "data.geojson"
  }
]

3) it wasn't clear to me how to specify an appropriate license for the code part in the scripts/ directory vs the package as a whole. I ended up describing it in plaintext in the readme, putting the code license (ISC) in the package.json, and the datapackage license (PDDL) in the package.jsonld.

rufuspollock commented 10 years ago

@jden

0) I agree - i've largely hand-created though I've started using dpm init to bootstrap the file as it auto extracts table info

1) we debated having package.json (for node) and datapackage.json for datapackage stuff and ultimately decided you do want them different - see #73 for a bit more on this

2) I think your suggestion seems good and how we've done datapackage.json resource entries so far

3) I think that is correct and how i'd go with things :-)

jbenet commented 10 years ago

@jden

Here's an example dataset I'm building

Cool!

I am still interested in creating schemas to describe the csv and geojson representations of the data, but haven't looked into it yet.

If you give me a couple of weeks, Transformer (repo) will help you do this really easily.

1) it seems silly having both package.json and package.jsonld at the root level of the directory.

This will be the case as long as different registries do not agree on their structures. If you published this to rubygems too you'd also have a .gemspec.

I'm torn between whether I'd prefer npm to be package.jsonld aware and just parse package.jsonld

We could open up a discussion about getting to this. Frankly, now that JSON-LD exists, there's no reason we can't have the same package.jsonld spec for every package manager out there, and use different @contexts (thank you, @msporny!!). Developers are free to adhere to whatever @context they want (and thus whatever key/val structure), and yet would be compatible across the board, if mappings between the properties of both contexts exists. :heart: data integration!

But, don't expect this to happen for years. :)

Actually... we might even be able to be package.json compatible... we'd need a special @context (which npm will just ignore) that gives us the mappings from the npm keys to our package.jsonld keys. Hm! (thank you @msporny !!!! in one @context swoop you fixed so much).

2) How would I indicate that this package has two representations of the same resource

In my biased world view, I'd include only one and use transformer to generate the second representation with a makefile. (or have both, but still have the transform) Something like:

data.geojson: data.csv
    cat data.csv | transform csv my-data-schema geojson > data.geojson

Note: this doesn't work yet. It will soon :)

On indicating this, your same name with different mediatype (or something like it) sgtm. And I would definitely put it in the Readme too.

3) it wasn't clear to me how to specify an appropriate license for the code part in the scripts/ directory vs the package as a whole. I ended up describing it in plaintext in the readme, putting the code license (ISC) in the package.json, and the datapackage license (PDDL) in the package.jsonld.

I think this (ISC in package.json#license and PDDL in package.jsonld#license) is precisely the right thing to do.

For code, it's common to add a LICENSE file in packages. We could establish a convention of putting the various licenses into the same file, or perhaps have two: LICENSE as usual for code, and DATA-LICENSE. (Personally, I think it's ugly to have these files and I never include them, because I think the license in package.json / Readme is enough for most modules I write. That said, if something gets really popular, and lots of people start using it to make products, it becomes more important to be legally safe than to have a clean directory. :) )

@rgrp

1) we debated having package.json (for node) and datapackage.json for datapackage stuff and ultimately decided you do want them different

does this change in light of the comments I made above, re being directly package.json compatible? I haven't given it full thought. I agree with your comments here, and also think it fine to have both package.json and datapackage.jsonld in both.

Actually, the .jsonld extension-- though nice-- may add much more trouble than it's worth yet, given that many tools know to parse .json as JSON and don't understand .jsonld (node's require, for example). Thoughts, @sballesteros @msporny? Is there a strong reason (other than indicating to humans that this is JSON-LD) to use .jsonld over .json ? JSON-LD was designed as an upgrade to (and fully compatible with) json, so maybe we should just use json? @rgrp, you probably prefer this, no?

@rgrp

Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources).

Not necessarily? As I understand, we can remap type -> @type in the context file, and try to use the types that are currently there for richer stuff. We get this for free without having to change anything, thanks to how JSON-LD works. Though, not sure whether people's use of type is well defined or that useful.

I want to get this finished soon, so let's settle our thought on @context, so we can draft changes and move fwd. :D

I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment (name).

I think it's really important to make the move to JSON-LD. And this IMO makes it a MUST (i actually think it's more important than name itself (i can go into why name itself could become non-MUST, but prob not helpful here / not good idea :) ).

I will definitely require it in any registries I write for data packages. Again, tools like dpm init help. AND, you can have dpm publish (or the registries themselves) insert it automatically on publishing. That way, users never have to worry about it at all! It just happens for them.

I'm happy to help in upgrading all existing packages (scripts to upgrade, + crawl CKAN and add @context to everything (would we need to email people to ask for permission? not clear on the ToS of CKAN).

If you're set on not making it a must, then I propose we keep data-packages.json as is. I can setup a fork of the Data Packages site with a Linked Data Packages, that tracks the original spec to the T, except for the MUST on the @context. Apologies, I don't mean to be uncompromising. This simply is a very important step forward to take, not just for data packages, but for the entire web itself.

If you were able to draft a description for the @context field to include that would be great and we could then review and slot it in the appropriate place.

How's this as a first draft?

@context (required) - URL of the Data Packages JSON-LD context definition. This MUST be a valid URL to a Data Packages compatible JSON-LD context. The @context SHOULD just be the URL of the most recent Data Packages context file: https://w3id.org/datapackage/v1.

The @context MAY be the URL to another context file, but it MUST be compatible with the Data Packages context: there MUST be relation between the properties outlined in Data Packages and the equivalent properties outlined in your own context.

The last line is super awkward. Does it even make sense?

@rgrp @msporny please correct any nonsense I might have spewed! :)

msporny commented 10 years ago

@jbenet Most of what you say above makes perfect sense.

re: .jsonld extension

You don't need to use a .jsonld extension, we just provided that for those that want to make the MIMEtype of the file clear using the file extension. A JSON-LD processor will process a file with a .json extension just fine. The only thing that's required is that the file contains a "@context" key.

The @context MAY be the URL to another context file, but it MUST be compatible with the Data Packages context: there MUST be relation between the properties outlined in Data Packages and the equivalent properties outlined in your own context.

Another option could be:

The @context MAY use a URL to an alternate context file. The alternate context file MUST be a superset of the Data Packages context; that is, every term mapping in the Data Packages context must exist as a term mapping in the alternate context file.

jbenet commented 10 years ago

The @context MAY use a URL to an alternate context file. The alternate context file MUST be a superset of the Data Packages context; that is, every term mapping in the Data Packages context must exist as a term mapping in the alternate context file.

Yeah, much better, thanks!

rufuspollock commented 10 years ago

@jbenet why can't you move ahead with using datapackage.json and data package spec as is and just add @context to all the data packages you create? My concern here is that this is a major addition and we've only had this thread and little exposure to wider community. As I've explained I'm not averse to the RDF stuff (and have done it heavily myself) but I'm concerned about making this a MUST without some period of evidence that this delivers benefits over costs. One thing we could do here is to take this to out a bit more. Also what about having a BoF about this at CSVConf in July in Berlin where we could meet face to face and hammer this out more with a bit of real-world experimentation under our belt ...

junosuarez commented 10 years ago

@rgrp as I see it, @context is the single required extensibility point that enables all of the other magic sauce. Open data is about reuse an interoperability in an unambiguous, machine-readable and -usable manner. Web browsers were able to get away with loose doctypes only in the face of massive installed user bases and a primarily human reader audience.

I understand your point about creating as little friction as possible for data set publishers. In practice, how many unique datasets is a typical individual author going to be producing? My guess is that the answer is "few enough to where they would be dependent on looking at documentation or using tooling every time".

Requiring @context is a breaking change, but it's a wholly automatable upgrade path, and I still contend that it's negligible overhead to dataset publishers. In terms of possible benefits, they are many, but even if they were limited to simply what has already been demonstreated by @jbenet in transformer, it would be worth it. Of course, adoption of json-ld will enable many other powerful applications, like analysis, aggregation, indexing, search, related content discovery, etc.

Put another way, package management is about more than dependency resolution and file/resource transfer. It's about making it easy to re-use existing components to create novel, more valuable combinations.

rufuspollock commented 10 years ago

@jden since I think we actually all agree on a lot here let me spell out the areas where I think things may not be clear or we don't have total agreement:

A. We're all happy for JSON-LD to be supported by Data Packages (in fact it does already) - rather, the question is whether we make it mandatory

B. What does JSON-LD support entail? It isn't just about having an @context: on its own my guess that that is fairly useless and that to really get benefits from the linked-data side of things you'll need to all include quite a bit more (e.g. @type, schema namespaces etc). This is important because it means that if you don't do that other stuff @context will just be overhead and if you don't want it just to be overhead you'll have to do a lot more.

C. Transformers. Its worth working this through in some detail because I think it highlights a potential misconception. I'm a big fan of data transformers of various kinds and we've written some (from the basic http://okfnlabs.org/transformer/ to stuff for generic tabular transformation - cf https://github.com/okfn/datapipes etc). My concern here in the discussion (and its illustrative of general discussion re linked data stuff) is how much a particular schema framework buys you. I've used RDF and json-ld quite a bit and I think they're great but I think its easy to over-estimate how much they - in themselves - buy you compared to all the work to defines schemas and get them broadly adopted and build the tooling around it. RDF-land is absolutely full of amazing ontologies and data transformation projects but they haven't led (yet!) to the amazing web of data and they often are pretty complex (and therefore hard for others to adopt). My point here isn't that I think we want to prevent those efforts but just that we should be aware that the real work in e.g. transform stuff isn't solved by adopting JSON-LD - you then need to establish a (widely-adopted) schema structure plus a bunch of tooling (that stuff needs to beats out just using python or javascript - for your transform system to get adopted it has to do better that writing my own scripts and this turns out to actually be pretty hard to do - as scripting is pretty powerful and libraries there are pretty awesome!).

This is a more general point which I can make a bit more succinctly:

Its easy to think that simply adopting a given schema-structure (e.g. JSON-LD) will automatically turn into a desired outcome (e.g. amazing query abilities, automated transform of data etc) when in fact much more is needed (much of it very hard).

Most specifically, schema stuff fundamentally depends on adoption a complex social process :-) People (including me) used to say: if only everyone would get stuff into RDF we'd get all of these incredible abilities to do X, Y, Z.

But a) most people found it tough to get stuff into RDF (because schemas etc are quite complex one you start looking at them) b) people can disagree on the right schema and its easy to make up new ones c) tooling around RDF was sometimes non-optimal and there were often some pretty awesome tools for doing things other (sometimes simpler) ways.

I mention this because you say:

Of course, adoption of json-ld will enable many other powerful applications, like analysis, aggregation, indexing, search, related content discovery, etc.

Is this really true? In some sense it is because "enable" is pretty non-committal :-) But I think you mean something stronger as in "If we do this [use JSON-LD], these other things will either follow directly or become much easier". I don't think that's true - you'll need to do a lot of other stuff - little of it explicitly dependent on JSON-LD.

Again let me reiterate I'm totally +1 on encouraging JSON-LD support but having been through a couple of rounds of linked data stuff before I'd like to be cautious in enforcing adoption on others. The reasons are simple:

with current upwards-compatability of the spec there's nothing to stop a large group of people adding JSON-LD support (and making it easy for others to add). If in 6m everyone in using full JSON-LD then we have a great argument for upgrading the spec (but going the other way is hard and painful - i.e. making it required now and then removing).
I've seen that even small addition burdens can affect people's use so keeping to simplest we can is a good thing and I also think we may underestimate the complexity of requiring linked data (its not just about using @context ...)

In summary: we want to build the absolute "minimum viable product" and then iterate based on the new evidence we have of needs and what works.

jbenet commented 10 years ago

Thanks @jden !

Hey @rgrp, we seem to be misunderstanding each other! D:

A. We're all happy for JSON-LD to be supported by Data Packages (in fact it does already) - rather, the question is whether we make it mandatory

Yep, we're all clear on this. :)

B. What does JSON-LD support entail? It isn't just about having an @context: on its own my guess that that is fairly useless and that to really get benefits from the linked-data side of things you'll need to all include quite a bit more (e.g. @type, schema namespaces etc). This is important because it means that if you don't do that other stuff @context will just be overhead and if you don't want it just to be overhead you'll have to do a lot more.

This assessment is incorrect :( -- @context will not just be overhead if it exists on its own. As @msporny described above, you do not need to include all the other stuff to already leverage the power of the schemas. That is precisely the point I'm trying to convey. Just using the @context, you can optionally enhance the semantic structure of your data in the context itself, using the exact same keys and values you are already using. You can assign @types and more directly in the context file, without modifying the package file again. If the meaning of certain key/values change, you can re-map that directly in the context file, without changing any of the package files. :D

C. .... y point here isn't that I think we want to prevent those efforts but just that we should be aware that the real work in e.g. transform stuff isn't solved by adopting JSON-LD - you then need to establish a (widely-adopted) schema structure plus a bunch of tooling (that stuff needs to beats out just using python or javascript - for your transform system to get adopted it has to do better that writing my own scripts and this turns out to actually be pretty hard to do - as scripting is pretty powerful and libraries there are pretty awesome!).

Yep! very right. we're very much on the same page re: transforms. I'm not arguing from magical semantic fantasy land.

The transformer stuff I'm working on is a library that will require lots of work to make individual conversions work. Its selling points are (a) by adding a bit of typing framework, one can do conversion resolution and simplify complex data transforms; and (b) using npm buys you a sophisticated module framework (semver, easy installing, dependency management, portability, etc), and the benefit of wrapping existing tools in npm. See http://transform.datadex.io

I've seen that even small addition burdens can affect people's use so keeping to simplest we can is a good thing

Agreed on keeping it the simplest. Making registry tools insert it on publish burden users very little, if at all. Most probably won't realize or ever have to care. This isn't an increase in complexity for end users.

More broadly, I definitely understand your frustration with the semantic web world. It tends to be hyper-idealistic (hah), and out of sync with the real world. This is why Google and Facebook created schema.org, the knowledge graph, and open graph. But JSON-LD is not the complex, prescriptive RDF world. It is a very pragmatic way to sprinkle hypermedia references around APIs as they exist today, with the ability to add semantic meaning in tiny bits. Who knows what will finally bring the Linked Data world about, but it seems that JSON-LD : Linked Data :: HTML/HTTP : Hypertext/Xanadu.

If everyone is using full JSON-LD then we have a great argument for upgrading the spec

Sure, let's go with that. :) Frankly, I've little interest left in finding resolution.

rufuspollock commented 10 years ago

@jbenet great, it sounds like we have provisional agreement for near-term in that JSON-LD @context is added to spec but not made mandatory. (Also let me emphasize that I'm pretty aware of the general excellence of JSON-LD and have used extensively in a couple of projects!).

So to summarize:

We have datapackage.json
We need a new issue (or to distill) here the adding of @context to the base spec in a "MAY" form. Would welcome your suggestions here on some text and then we can get it in ...

rufuspollock commented 10 years ago

For a thoughtful discussion that aligns with some of the points I'm making here see http://berjon.com/blog/2013/06/linked-data.html

rufuspollock commented 10 years ago

Proposal: @context will added as a MAY item in next iteration of data package spec.

pvgenuchten commented 9 years ago

hi @rgrp @Jbenet what is the current status of this "MAY" item in next spec? I also wonder if you guys are lining up with the DCAT&PROV initiatives from W3C. DCAT and PROV are managing a similar usecase. Introduce a spec for metadata to describe datasets in RDF, which can be encoded as json-ld easily.

jbenet commented 9 years ago

@pvgenuchten not sure-- @rgrp ?

rufuspollock commented 9 years ago

@jbenet @pvgenuchten no-one commented on the proposal so nothing happened :-) Generally I want to see a fair number of comments on a proposal to put a change in.

rufuspollock commented 9 years ago

@pvgenuchten @jbenet if someone could give me some sample language or submit a PR this can go in.

jbenet commented 9 years ago

@rgrp sample language for the datapackage.json?

rufuspollock commented 9 years ago

@jbenet yes plus language for the actual spec proposal

jbenet commented 9 years ago

@rgrp can you give me a precise example of what you want, say for another field, like version?

rufuspollock commented 9 years ago

@jbenet I'd be looking for relevant language to add to the spec to specify what property (or properties) to add e.g. @context and how they should be used.

jbenet commented 9 years ago

@rgrp the directions are too vague. do you want a patch to http://dataprotocols.org/data-packages/ ?

how much of a connection to @context are you willing to have? proper JSON-LD makes this required. IIRC you disagree with forcing all data-packages to be linked data. So-- do you want it under the SHOULD part of required fields? (I still think the point is lost, even if a little better).

Also, as mentioned before, @context sort of obviates specs-- or rather, makes them machine-readable. So one way to go about this is to make the @context point to a JSON-LD context file with the data-package spec, allowing users to point somewhere else if they're using a different spec. But you probably dont wan't that-- you probably want them to still strictly follow the data-packages spec (otherwise non-LD parsers would break)-- so maybe make it so any other context url needs to be derived from yours (have every field covered)?

It's also easy to treat all data-packages without an @context as if they had one default @context url-- namely yours.

Note this also needs a proper JSON-LD context file representing the machine-readable version of this spec. Hmmm, i don't have enough time to take this whole thing on right now-- do you have anyone else on the team that cares about linked data to work with me on this?

rufuspollock commented 9 years ago

@jbenet I think it would come under the "MAY" style fields. I'm not sure I understand enough here to get the complexity. No problem if you don't have time for this right now and we can wait to see if someone else volunteers to get this in.

hubgit commented 9 years ago

I note that the link to package.jsonld in the issue description now leads to a 404 page - is package.jsonld still a thing?

@jbenet Would you be able to write out the data contained in a datapackage.json file as Turtle, so we can see what the URLs for each property would be?

sballesteros commented 9 years ago

Yep still a thing. We haven't had time to give it a new home yet. We have been merging it with the work done by the CSV on the web working group: see http://www.w3.org/standards/techs/csv#w3c_all. More soon.

hubgit commented 9 years ago

We have been merging it with the work done by the CSV on the web working group

In that case, this issue should probably be closed in favour of a new "Use the W3C Metadata Vocabulary for Tabular Data" issue.

@rgrp - Is the plan to transition to the CSV-WG's JSON data package description, when it becomes a Recommendation?

rufuspollock commented 9 years ago

@hubgit no, no intention to transition to that spec as it isn't data package. Whilst directly inspired by Data Package and Tabular Data Package and i'm an editor I think it has currently diverged a lot.

So, still useful to get JSON-LD compatibility in here and this issue should stay open.