Closed jbenet closed 8 years ago
Hi!
What we love about JSON-LD is that it can be seen as one serialization of RDF and can therefore be converted in RDFa and therefore directly inserted into HTML documents. It opens some cool possibilities, like you are reading a New York Times article for instance and you can ldpm install it and start hacking on the data. Everything your data package manager needs to know is directly embedded into the HTML! To me, being able to embed a package.json-like-thing into a webpage, respecting open web standards, is amazing. Regarding schema.org, our "dream" is to be able to leverage the web as the registry using some markup already being indexed by the major search engines (google, yahoo!, yandex and bing). Check http://datasets.schema-labs.appspot.com/ for instance.
I would encourage anyone interested in that to go read the JSON-LD spec and the RDFa Lite spec. Both are super well written. The RDFa Lite spec in particular is remarkably short.
That being said, we are still experimenting a lot with that approach and 100% agree that soon enough we should work on merging all of that (and happy to contribute to the work)...
Another thing to follow closely is: CSV-LD.
Forgot to mention but for datatypes and co http://www.w3.org/TR/xmlschema-2/#built-in-datatypes is here to help (and can prevent re-inventing a spec for datatypes).
@jbenet great to hear from you and good questions. Obviously my recommendation here would be that we converge on datapackage.json - I should also say that @sballesteros has been a major contributor to the datapackage.json spec as it stands :-)
I note there are plans to introduce a few json-ld-isms (see #89) into datapackage.json but the basic aim is to keep this as simple as possible and pretty close to the commonjs package spec. Whilst I appreciate RDF's benefits (I've been a heavy RDF user in times past) I think we need to keep things super-simple if we are going to generate adoption - most data producers and users are closer to the Excel than the RDF end of the spectrum. (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that).
That said the differences seem pretty minor here in the basic outline so with a small bit of tweaking we could have compatibility.
@sballesteros I note main differences seem, at the moment, to be:
If we could resolve these and perhaps define a natural enhancement path from a datapackage.json to become "json-ld" compliant we could have a common base - those who wanted full json-ld could 'enhance' the base datapackage.json in their desired ways but we'd keep the simplicity (and commonjs compatability) for non-RDF folks.
wdyt?
@jbenet more broadly - great to see what you are up to. Have you seen https://github.com/okfn/dpm - the data package manager? That seems to have quite a bit in common with the data
tool I see you are working on. Perhaps we could converge efforts there too?
There's also a specific issue for the registry at https://github.com/okfn/dpm/issues/5 - the current suggestion had been piggy-backing on github but I know we also have options in terms of CKAN and @sballesteros has worked on a couchdb based registry.
I would said that given that using the npm registry is no longer really an option, alignment with schema.org is more interesting than commonjs compatibility but I am obviously biased ;)
A counter argument to that would be dat
: @maxogden, do you know how dat
is going to leverage transformation modules (do you plan to use the scripts
property of commonJS ?)
To me alignment with schema.org => we can generate a package.jsonld from any webpage with RDFa markup (or microdata). You can treat JSON-LD as almost JSON (just an extra @context
) and in this case there is no additional complexity involved and no need to mention / know RDF at all.
Hey all,
Another argument in favour of a spec supporting JSON-LD and aligned with schema.org is explorability. Being able to communicate unambiguously that a given dataset/resource deals with http://en.wikipedia.org/wiki/Crime and http://en.wikipedia.org/wiki/Sentence_(law) for example, goes a longer way than keywords and a description. It makes it query-ready.
"dataset": [
{
"name": "mycsv",
"about": [
{
"name": “crimes”,
"sameAs": "http://en.wikipedia.org/wiki/Crime"
},
{
"name": “sentences”,
"sameAs": "http://en.wikipedia.org/wiki/Sentence_(law)
}
],
...
}
]
@rgrp thanks for checking this out!
@rgrp said:
Whilst I appreciate RDF's benefits ... I think we need to keep things super-simple if we are going to generate adoption ... (I note one can always choose to enhance a given datapackage.json to be json-ld like - I just think we don't want to require that).
Strong +1 for simplicity and ease of use for end users. My target user is the average scientist. Friction (in terms of having to learn how to use tools, or ambiguity in the process) is deadly.
I don't think that making the format json-ld compliant will add complexity beyond that ease of use. JSON-LD was designed specifically the smallest overhead, that still provides data-linking power. I found these blog posts from Manu (the primary creator), quite informative:
If I were building any package manager today, I would aim for JSON-LD as a format, seeking the easiest (and readable-ish) integration to other tools. I think JSON is already "difficult to read" to non-developers (hence people's use of YAML and TOML, which are inadequate for other reasons), the JSON-LD @context additions don't seem to make matters significantly worse.
I think even learning the super simple NPM package.json
is hard enough for most scientists -- as a prerequisite to publishing their data. I claim the solution is to build a very simple tool (both CLI and GUI) that guides users through populating whatever package format we end up converging in. My data
tool already does this, though that could still be even simpler. GUIs will be important for lots of users.
@rgrp I found your dpm after I had already built mine. We should definitely discuss converging there. I'm working with @maxogden and we're building dat + datadex to be interoperable. Also, one of the use cases I care a lot about is large datasets (100GB+) in Machine Learning and Bioinformatics. I'm not sure how much you've digged into how data
and datadex work, but it separates the data + the registry metadata, such that the registry can be very lightweight, and the data can be retreived directly from S3, Google drive, or peers (yes, p2p distribution). The way it works right now will change. Let's talk more off-band. I'll send you an email :)
@sballesteros what do you think of the differences @rgrp pointed out? Namely:
- rename of 'resources' to 'datasets'
- rename of 'path' to 'distribution'
- 'code' element (currently 'scripts' is being used informally in datapackage.json echoing the approach in node)
IMO:
resources
is more general.distribution
seems more general. @sballesteros what else (other than contentPath
) could go here?code
in package.jsonld seems to be more descriptive than scripts
, however, scripts
is simple and works well already. not sure.And, @sballesteros, do you see other differences? What else do you remember being explicitly different?
Let's try to get convergence on these :)
@jbenet before diving into the small differences and trying to converge somewhere, I think we should really think of why we should move away from vocabularies promoted by the W3C (like DCAT). To me, schema.org has already done a huge amount of work to try to bring as much pragmatism as possible in that space see: http://www.w3.org/wiki/WebSchemas/Datasets for instance.
Why don't we join the W3C mailing lists and take action there so that new properties are added if we need them for our different data package managers?
The way I see it is that unlike npm and software package manager, for open data, one of the key challenge is to make data more accessible to search engines (there are so many different decentralized data publishers out there...). Schema.org is a great step in that direction so in my opinion it is worth the little amount of clunkiness in the property names that it imposes. Like you said, a GUI is going to be needed for beginners anyway so the incentive to move away from W3C standard for easier to read prop. name is low.
Just wanted to make that clear but all that being said, super happy to go in convergence mode.
Let's separate several concerns here:
resources
over datasets (this does not really occur in the DCAT spec and i think datasets is confusing since what you are describing by the datapackage.json is a dataset)path
+ url
into distribution
- though it means a bit of effort for parsers (e.g. to work out if something is a relative path)code
vs scripts
- not in the spec properly yet so happy either way frankly (we could even support both in the interim)@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). dpm
is intended to continue to respect that setup so it sounds like we are very close here :-)
@sballesteros (aside) I'm not sure accessibility to search engines is the major concern in this - the concern is integration with tooling. We already get reasonable discovery from search engines (sure its far from perfect, but its no worse than for code). Key here for me is that data is more like code than it is like "content". As such, what we most want is better toolchains and processing pipelines for data. As such the test of our spec is now how it integrates with html page markup but how it supports use in data toolchains. As a basic test: can we do dependencies and automated installation (into a DB!).
- A. MUST datapackage.json be valid JSON-LD?
I think yes. I understand the implications of MUST semantics, and the unfortunate upgrading overhead costs it imposes. But without requiring this, applications cannot rely on a package definition being proper linked-data. They require data-package.json
specific parsing. In a way, it constrains the reach of the format. (FWIW, JSON-LD is a very pragmatic format.)
To better understand the costs of converting exiting things, it would be useful to get a clear picture of the current usage of data-package.json
. I see datahub.io and other CKAN instances. @rgrp Am I correct in assuming all of these use it? What's the completeness of the CKAN instances list (i.e. is that a tiny or large fraction)?
- B. What is required for datapackage.json to be valid JSON-LD?
I believe @context
and @id
is enough for valid JSON-LD, though the spec defines more that would be useful. I'm new to it, so I'll dive in and post back here with a straight JSON-LD-fication of data-package.json
. In the meantime, @sballesteros what's your answer to this? What else did you have to and get to use?
- C. DCAT "compatability".
Relevant mappings between DCAT and Schema.org. I'm new to DCAT, so can't comment on its vocabulary, beyond echoing "let's try not to break compatibility unless we must." @sballesteros ?
@jbenet current spec allows you to store data anywhere and tradition from CKAN continued in datapackage.json is that you have flexibility as to whether data is stored "next" to the metadata or separately (e.g. in s3, a specific DB, a dat instance etc). dpm is intended to continue to respect that setup so it sounds like we are very close here :-)
Sounds great! I care strongly about backing up everything, in case individuals stop maintaining what they published. IMO, what npm does is exactly right: back up published versions, AND link to the github repo. Data is obviously much more complicated, given licensing, storage, and bandwidth concerns. I came up with a solution-- more on this later :).
accessibility to search engines
I don't particularly care much about this either. Search engines already do really well (and Links tend to be the problem, not format). IMO a JSON-LD format that uses either an existing vocabulary or one with good mappings will work well. @sballesteros what are your concerns here?
@jbenet thanks for the responses which are very useful.
On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO.
This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity.
@rgrp
@context
will probably be the same in all documents. It's important to have though, for non-data-package-specific JSON-LD enabled apps. @id
will be unique per dataset per version.I think these could be filled in automatically by dat
, data
, dpm
, and ldpm
before submission. That way people don't have to worry about the conceptual complexity of understanding RDF.
For instance, say I have a dataset cifar
that I want to publish to datadex. It's version is named 1.1-py
. Based on how datadex works now, this dataset's unique id is jbenet/cifar@1.1-py
(i call this a handle, this is datadex specific). dat
, dpm
, and ldpm
also have namespace/versioning schemes that uniquely identify versions. So, can use that as an @id
directly. On submission, the tool would automatically fill in:
{
@context: "<url to our>/data-package-context.jsonld",
@id: "http://datadex.io/jbenet/cifar-100@1.0-py",
... // everything else
}
@jbenet I must confess I still think this is unnecessarily burdensome addition as a requirement for all users. As I said there's no reason users or even a given group cannot add these to their datapackage.json but this adds quite a bit of "cognitive complexity" for those users who are unfamiliar with RDF and linked data.
There are very few required fields at the moment in datapackage.json and anything that goes in has to be seen as showing a very strong benefit over cost (remember each time we add stuff we make it more likely people either won't use it or won't actually produce valid datapackage.json).
Whilst I acknowledge that quite a lot (perhaps most) datapackage.json will be created by tools I think some people will want to edit by hand (and want to understand the files they look at). (I'm an example of a by-hand editor ;-) ...)
anything that goes in has to be seen as showing a very strong benefit over cost
Entirely agreed. Perhaps the benefits of ensuring every package is JSON-LD compliant aren't clear. Any program that understands JSON-LD would then be able to understand datapackage.jsonld
automatically. Without the need for human intervention (writing parsers, running them, telling the program how to manipulate this format, etc). This is huge. On the same -- or greater -- of importance than having a version.
This video is aimed towards a very general audience, but still highlights the core principles: https://www.youtube.com/watch?v=vioCbTo3C-4
Many people have been harping on the benefits of linking data for over a decade, so I won't repeat all that here. The JSON-LD website and posts by @msporny highlight some of the more pragmatic (yay!) reasoning. Will note that it only works for the entire data web if the context is there (as the video explains). That's what enables programs that know nothing at all about this particular format to completely understand and be able to process the file. Think of it as a link to a machine-understandable RFC spec that teaches the program how to read the rest of the data. (without humans having to program that knowledge in manually).
I think some people will want to edit by hand (and want to understand the files they look at).
Absolutely, me too. But imagine it's your first time looking at datapackage.json
. Does
{
"name": "a-unique-human-readable-and-url-usable-identifier",
"datapackage_version": "1.0-beta",
"title": "A nice title",
"description": "...",
"version": "2.0",
"keywords": ["name", "My new keyword"],
"licenses": [{
"url": "http://opendatacommons.org/licenses/pddl/",
"name": "Open Data Commons Public Domain",
"version": "1.0",
"id": "odc-pddl"
}]
"sources": [{
"name": "World Bank and OECD",
"web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}],
"contributors":[ {
"name": "Joe Bloggs",
"email": "joe@bloggs.com",
"web": "http://www.bloggs.com"
}],
"maintainers": [{
# like contributors
}],
"publishers": [{
# like contributors
}],
"dependencies": {
"data-package-name": ">=1.0"
}
"resources": [
{
}
]
}
Look much better than
{
"@context": "http://okfn.org/datapackage-context.jsonld",
"@id": "a-unique-human-readable-and-url-usable-identifier",
"datapackage_version": "1.0-beta",
"title": "A nice title",
"description": "...",
"version": "2.0",
"keywords": ["name", "My new keyword"],
"licenses": [{
"url": "http://opendatacommons.org/licenses/pddl/",
"name": "Open Data Commons Public Domain",
"version": "1.0",
"id": "odc-pddl"
}]
"sources": [{
"name": "World Bank and OECD",
"web": "http://data.worldbank.org/indicator/NY.GDP.MKTP.CD"
}],
"contributors":[ {
"name": "Joe Bloggs",
"email": "joe@bloggs.com",
"web": "http://www.bloggs.com"
}],
"maintainers": [{
# like contributors
}],
"publishers": [{
# like contributors
}],
"dependencies": {
"data-package-name": ">=1.0"
}
"resources": [
{
}
]
}
?
I would imagine thinking things like:
versions
?datapackage_version
?-beta
part?contributors
, maintainers
, publishers
.contributors
, maintainers
, publishers
anyway?keywords
? anything? name
, title
, and description
? What will go where?The latter adds:
@context
thing?IMO, these would involve looking up the spec and understanding how the format works. I care a lot about readability (i originally had picked yaml for datadex) But i claim readability for new users is not affected significantly here. :)
@rgrp wrote:
On point A - what MUST be done to support JSON-LD, the fact that json-ld support would mean a MUST for @id and @context is a concern IMO. This is significant conceptual complexity for most people (e.g. doesn't the id need to be to a valid RDF class?). This is where I'd like to allow but not require that additional complexity.
@id is not required for a valid JSON-LD document. Also note that you can alias "@id" to something less strange looking, like "id" or "url", for instance. The ID doesn't need to be a valid RDF class. The only thing that's truly required to transform a JSON document to a JSON-LD document is one line - @context. None of your users need to be burdened w/ RDF or Linked Data concepts unless they want to be. Just my $0.02. :)
Weighing in briefly after being directed to this thread by @maxogden. I am not currently developing any tools, but rather looking for forward-thinking best practices around metadata for datasets I'm working with a city to help publish, so in that sense I am your end user.
From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward.
Pertinent to this thread: given what @msporny said about being able to alias @id
, and given the cognitive overhead of datapackage_version
which @jbenet mentioned, would it be possible to use @context
to also indicate the version number of the datapackage spec? eg:
{
"@context": "http://okfn.org/datapackage-context.jsonld#0.1.1",
"id": "http://dathub.org/my-dataset",
"title": "my dataset",
"version": "1b76fa0893628af6c72d7fa7a6c10f8e7101c31c"
}
In my example, I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number.
From a government point of view: many governmental entities at a similar level (e.g. cities, transit agencies, school districts) are dealing with similar but very differently structured data - often in contexts that care very little about schemas or data re-use, and in strange, vendor-specific formats. In order to do any sort of comparative analysis or re-use open source tools, it is very important to be able to create ad-hoc schemas. Metadata formats which make this easier by including linked data concepts are a huge step forward.
:+1: Thank you. I will quote this in the future. :)
would it be possible to use @context to also indicate the version number of the datapackage spec?
Yeah, absolutely. It's any URL, so you can embed a version number in the url and thus identify a different @context
. And, great point. We should establish a culture of doing that. I don't think we should require it, as i can imagine cases where it would be more problematic than ideal (not to say how hard and annoying it would be to impose a version scheme on others).
"http://okfn.org/datapackage-context.jsonld#0.1.1"
I believe JSON-LD can extract the right @context
from a #<fragment>
, though not 100% sure. @msporny will know better. If not, embed it in the path:
"http://okfn.org/datapackage-context@<version>.jsonld"
"http://okfn.org/datapackage-context.<version>.jsonld"
"http://okfn.org/datapackage-context/<version>.jsonld"
"http://okfn.org/<version>/datapackage-context.jsonld"
I'm also using a hash as a datapackage version rather than a semver, since for frequently changing, properly versioned data, it is unfeasible to have a manually incremented version number.
:+1: hash versions ftw. What are you building? And i encourage you to allow tagging of versions. the rightest thing i've seen is to have hashes (content-addressing) identify versions, and allow human-readable tags/symlinks (yay git).
@rgrp thoughts on all this? Can we move fwd with @context
or do you still think it is inflexible? Happy to discuss it more/make the case through a higher throughput medium (+hangout?).
@sballesteros if we have @context
here, does that satisfy your constraints? (given that your own @context
could alias the naming the names you're currently using in package.jsonld
to those used in the standard @context
).
@jden great input.
@jden @jbenet re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely). People rarely add it and I'm doubtful it would be reliably maintained in which case its value to consumers rapidly falls towards zero. (I was sort of doubtful when first added but there were strong arguments in favour by others at the time).
Re the general version field I note that semver allows using hashes a sort of extra e.g. 1.0.0-beta+5114f85...
However, I do wonder about using version field at all if you are using full version control for the data - I imagined the version field being more like version field for software packages where its increment means something substantial (but where you can individual revisions if you want from the version control system - cf the node.js package.json where dependencies can refer either to package versions or to specific revisions for git repos).
re datapackage_version I actually think we should deprecate this a bit further (its not strictly required but I think, frankyl, we remove completely).
I agree, let's remove datapackage_version
and put a version number in the url of our @context
. That's the LD way of doing it. FWIW, having a version is a good thing, and this way (in the url) we get seamless enforcement without the additional complexity of an optional field). And it's pretty good: the version isn't just a number to lookup somewhere; it points directly to a different file. :)
However, I do wonder about using version field at all
Having versions in pacakage-managers/registries is really useful. Let's not remove this. The package manager websites want to show meaningful descriptions of important version changes (semver). Users can understand the difference between 1.0.3
and 2.6.4
(one's newer) and conclude, usually correctly, which one is better. Git, which is full version control, makes extensive use of tags/branches (which are both just named version pointers). Hence I recommended to @jden to allow tagging :).
semver
By the way-- i'm not sure if you came up with something similar, but I put a tiny bit of thought into making a data-semver https://github.com/jbenet/data/issues/13 which might be useful. Clearly expressing what constitutes a MAJOR.MINOR.PATCH version change in the data context will help avoid confusion for people working with data that don't understand the subtleties of code semver.
@rgrp can we go fwd with @context
? (Lmk if you need more time for consideration-- just wanting to get closer to done with this as we'll be using data packages soon and would like to have things resolved before that happens :) ).
On the @context question: let me reiterate that I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment (name
).
Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources). Again i think this is great if you can do it since you get much richer info - and data package has been designed so you can do this progressive enhancement really easily (just add the @type to your resource schema) but I don't think it should be required for everyone.
@jbenet to be clear i wasn't suggesting removing version
- i was saying i wasn't sure about using it for sha hash of a changeset (since as @jden mentions that changes so much). I think version
is super-useful and isn't going anywhere. As you say the primary use for version
(IMO) is more like the tags one has in git.
Also note I wrote my previous comment before I'd read your response. My suggested approach at present is that we add @context
to the spec with a clear description of use but that we don't make it a MUST. If you were able to draft a description for the @context
field to include that would be great and we could then review and slot it in the appropriate place.
I believe JSON-LD can extract the right @context from a #, though not 100% sure. @msporny will know better. If not, embed it in the path:
No, JSON-LD will not extract the "right" context from a #fragment :). We considered that option and felt that it adds unnecessary complexity (when a simpler alternative would solve the same problem). Just do this if you want to version the context:
"@context": "http://okfn.org/datapackage-context/v1.jsonld"
You are probably going to want to use a URL redirecting service so that your developers don't see breaking changes if okfn.org ever goes away. For example, use https://w3id.org/ and make this your context URL:
https://w3id.org/datapackage/v1
This does three things:
1) It decouples the hosting location of the actual file from the identifier that people type out to use the context. So, if you decide to change the hosting location from okfn.org to some other hosting provider, none of the programs using the context are affected. 2) It gives people an easy-to-remember URL for the JSON-LD Context. 3) It provides a hint to clients as to the version of the vocabulary you're using.
I can add it to the site in less than a minute if you want (or you can submit a pull request). w3id.org is backed by multiple companies and is designed to be around for 50+ years. You can learn more about it by going here: https://w3id.org/
(edit: fixed UTF-8 BOM - no idea how that got in there)
Somehow the w3id.org homepage link at the end of https://github.com/dataprotocols/dataprotocols/issues/110#issuecomment-41914984 is broken for me due to a utf8 bom that's crept in? Source code shows it as https://w3id.org/%EF%BB%BF. Strange. https://w3id.org/ works.
@jbenet @rgrp Here's an example dataset I'm building: https://github.com/jden/data-bike-chattanooga-docks
Some thoughts from the experience (albeit tangential to this thread):
0) all of this metadata was created by hand, without tooling. What I've filled out is about as far as I got before moving on. I am still interested in creating schemas to describe the csv and geojson representations of the data, but haven't looked into it yet.
1) it seems silly having both package.json and package.jsonld at the root level of the directory. I'm torn between whether I'd prefer npm to be package.jsonld
aware and just parse package.jsonld
or to have some other place to put package.json
. From a package user experience point of view, I really want someone to be able to rebuild my data from git cloning
, npm install
ing, npm start
ing.
2) How would I indicate that this package has two representations of the same resource (data.csv
and data.geojson
), as opposed to for example two separate-but-related tables? From a REST background, my inclination would be to do something like
"resources": [
{
"name": "data",
"mediatype": "text/csv",
"path": "data.csv"
},
{
"name": "data",
"mediatype": "application/json",
"path": "data.geojson"
}
]
3) it wasn't clear to me how to specify an appropriate license for the code part in the scripts/
directory vs the package as a whole. I ended up describing it in plaintext in the readme, putting the code license (ISC) in the package.json
, and the datapackage license (PDDL) in the package.jsonld
.
@jden
0) I agree - i've largely hand-created though I've started using dpm init
to bootstrap the file as it auto extracts table info
1) we debated having package.json (for node) and datapackage.json for datapackage stuff and ultimately decided you do want them different - see #73 for a bit more on this
2) I think your suggestion seems good and how we've done datapackage.json
resource entries so far
3) I think that is correct and how i'd go with things :-)
@jden
Here's an example dataset I'm building
Cool!
I am still interested in creating schemas to describe the csv and geojson representations of the data, but haven't looked into it yet.
If you give me a couple of weeks, Transformer (repo) will help you do this really easily.
1) it seems silly having both package.json and package.jsonld at the root level of the directory.
This will be the case as long as different registries do not agree on their structures. If you published this to rubygems too you'd also have a .gemspec
.
I'm torn between whether I'd prefer npm to be
package.jsonld
aware and just parsepackage.jsonld
We could open up a discussion about getting to this. Frankly, now that JSON-LD exists, there's no reason we can't have the same package.jsonld spec for every package manager out there, and use different @contexts
(thank you, @msporny!!). Developers are free to adhere to whatever @context
they want (and thus whatever key/val structure), and yet would be compatible across the board, if mappings between the properties of both contexts exists. :heart: data integration!
But, don't expect this to happen for years. :)
Actually... we might even be able to be package.json
compatible... we'd need a special @context
(which npm will just ignore) that gives us the mappings from the npm keys to our package.jsonld
keys. Hm! (thank you @msporny !!!! in one @context
swoop you fixed so much).
2) How would I indicate that this package has two representations of the same resource
In my biased world view, I'd include only one and use transformer to generate the second representation with a makefile. (or have both, but still have the transform) Something like:
data.geojson: data.csv
cat data.csv | transform csv my-data-schema geojson > data.geojson
Note: this doesn't work yet. It will soon :)
On indicating this, your same name
with different mediatype
(or something like it) sgtm. And I would definitely put it in the Readme too.
3) it wasn't clear to me how to specify an appropriate license for the code part in the scripts/ directory vs the package as a whole. I ended up describing it in plaintext in the readme, putting the code license (ISC) in the package.json, and the datapackage license (PDDL) in the package.jsonld.
I think this (ISC in package.json#license
and PDDL in package.jsonld#license
) is precisely the right thing to do.
For code, it's common to add a LICENSE file in packages. We could establish a convention of putting the various licenses into the same file, or perhaps have two: LICENSE
as usual for code, and DATA-LICENSE
. (Personally, I think it's ugly to have these files and I never include them, because I think the license in package.json / Readme is enough for most modules I write. That said, if something gets really popular, and lots of people start using it to make products, it becomes more important to be legally safe than to have a clean directory. :) )
@rgrp
1) we debated having package.json (for node) and datapackage.json for datapackage stuff and ultimately decided you do want them different
does this change in light of the comments I made above, re being directly package.json
compatible? I haven't given it full thought. I agree with your comments here, and also think it fine to have both package.json
and datapackage.jsonld
in both.
Actually, the .jsonld
extension-- though nice-- may add much more trouble than it's worth yet, given that many tools know to parse .json
as JSON and don't understand .jsonld
(node's require, for example). Thoughts, @sballesteros @msporny? Is there a strong reason (other than indicating to humans that this is JSON-LD
) to use .jsonld
over .json
? JSON-LD was designed as an upgrade to (and fully compatible with) json, so maybe we should just use json
? @rgrp, you probably prefer this, no?
@rgrp
Also do be aware that @context on its own buys you rather little. If you really want to benefit from the rich schema/ontology support of RDF you are going to want to integrate a lot more stuff (e.g. into the type fields of the resources).
Not necessarily? As I understand, we can remap type -> @type
in the context file, and try to use the types that are currently there for richer stuff. We get this for free without having to change anything, thanks to how JSON-LD works. Though, not sure whether people's use of type
is well defined or that useful.
I want to get this finished soon, so let's settle our thought on @context
, so we can draft changes and move fwd. :D
I'm a strong +1 on allowing this in (even encouraging it) but I am still concerned about making it a MUST. Strictly we only have one required field at the moment (name).
I think it's really important to make the move to JSON-LD. And this IMO makes it a MUST (i actually think it's more important than name
itself (i can go into why name itself could become non-MUST, but prob not helpful here / not good idea :) ).
I will definitely require it in any registries I write for data packages. Again, tools like dpm init
help. AND, you can have dpm publish
(or the registries themselves) insert it automatically on publishing. That way, users never have to worry about it at all! It just happens for them.
I'm happy to help in upgrading all existing packages (scripts to upgrade, + crawl CKAN and add @context
to everything (would we need to email people to ask for permission? not clear on the ToS of CKAN).
If you're set on not making it a must, then I propose we keep data-packages.json
as is. I can setup a fork of the Data Packages
site with a Linked Data Packages
, that tracks the original spec to the T, except for the MUST on the @context
. Apologies, I don't mean to be uncompromising. This simply is a very important step forward to take, not just for data packages, but for the entire web itself.
If you were able to draft a description for the @context field to include that would be great and we could then review and slot it in the appropriate place.
How's this as a first draft?
@context
(required) - URL of the Data Packages JSON-LD context definition. This MUST be a valid URL to a Data Packages compatible JSON-LD context. The@context
SHOULD just be the URL of the most recent Data Packages context file:https://w3id.org/datapackage/v1
.The
@context
MAY be the URL to another context file, but it MUST be compatible with the Data Packages context: there MUST be relation between the properties outlined in Data Packages and the equivalent properties outlined in your own context.
The last line is super awkward. Does it even make sense?
@rgrp @msporny please correct any nonsense I might have spewed! :)
@jbenet Most of what you say above makes perfect sense.
re: .jsonld extension
You don't need to use a .jsonld extension, we just provided that for those that want to make the MIMEtype of the file clear using the file extension. A JSON-LD processor will process a file with a .json extension just fine. The only thing that's required is that the file contains a "@context" key.
The @context MAY be the URL to another context file, but it MUST be compatible with the Data Packages context: there MUST be relation between the properties outlined in Data Packages and the equivalent properties outlined in your own context.
Another option could be:
The @context MAY use a URL to an alternate context file. The alternate context file MUST be a superset of the Data Packages context; that is, every term mapping in the Data Packages context must exist as a term mapping in the alternate context file.
The @context MAY use a URL to an alternate context file. The alternate context file MUST be a superset of the Data Packages context; that is, every term mapping in the Data Packages context must exist as a term mapping in the alternate context file.
Yeah, much better, thanks!
@jbenet why can't you move ahead with using datapackage.json and data package spec as is and just add @context to all the data packages you create? My concern here is that this is a major addition and we've only had this thread and little exposure to wider community. As I've explained I'm not averse to the RDF stuff (and have done it heavily myself) but I'm concerned about making this a MUST without some period of evidence that this delivers benefits over costs. One thing we could do here is to take this to out a bit more. Also what about having a BoF about this at CSVConf in July in Berlin where we could meet face to face and hammer this out more with a bit of real-world experimentation under our belt ...
@rgrp as I see it, @context
is the single required extensibility point that enables all of the other magic sauce. Open data is about reuse an interoperability in an unambiguous, machine-readable and -usable manner. Web browsers were able to get away with loose doctypes only in the face of massive installed user bases and a primarily human reader audience.
I understand your point about creating as little friction as possible for data set publishers. In practice, how many unique datasets is a typical individual author going to be producing? My guess is that the answer is "few enough to where they would be dependent on looking at documentation or using tooling every time".
Requiring @context
is a breaking change, but it's a wholly automatable upgrade path, and I still contend that it's negligible overhead to dataset publishers. In terms of possible benefits, they are many, but even if they were limited to simply what has already been demonstreated by @jbenet in transformer
, it would be worth it. Of course, adoption of json-ld will enable many other powerful applications, like analysis, aggregation, indexing, search, related content discovery, etc.
Put another way, package management is about more than dependency resolution and file/resource transfer. It's about making it easy to re-use existing components to create novel, more valuable combinations.
@jden since I think we actually all agree on a lot here let me spell out the areas where I think things may not be clear or we don't have total agreement:
A. We're all happy for JSON-LD to be supported by Data Packages (in fact it does already) - rather, the question is whether we make it mandatory
B. What does JSON-LD support entail? It isn't just about having an @context: on its own my guess that that is fairly useless and that to really get benefits from the linked-data side of things you'll need to all include quite a bit more (e.g. @type, schema namespaces etc). This is important because it means that if you don't do that other stuff @context will just be overhead and if you don't want it just to be overhead you'll have to do a lot more.
C. Transformers. Its worth working this through in some detail because I think it highlights a potential misconception. I'm a big fan of data transformers of various kinds and we've written some (from the basic http://okfnlabs.org/transformer/ to stuff for generic tabular transformation - cf https://github.com/okfn/datapipes etc). My concern here in the discussion (and its illustrative of general discussion re linked data stuff) is how much a particular schema framework buys you. I've used RDF and json-ld quite a bit and I think they're great but I think its easy to over-estimate how much they - in themselves - buy you compared to all the work to defines schemas and get them broadly adopted and build the tooling around it. RDF-land is absolutely full of amazing ontologies and data transformation projects but they haven't led (yet!) to the amazing web of data and they often are pretty complex (and therefore hard for others to adopt). My point here isn't that I think we want to prevent those efforts but just that we should be aware that the real work in e.g. transform stuff isn't solved by adopting JSON-LD - you then need to establish a (widely-adopted) schema structure plus a bunch of tooling (that stuff needs to beats out just using python or javascript - for your transform system to get adopted it has to do better that writing my own scripts and this turns out to actually be pretty hard to do - as scripting is pretty powerful and libraries there are pretty awesome!).
This is a more general point which I can make a bit more succinctly:
Its easy to think that simply adopting a given schema-structure (e.g. JSON-LD) will automatically turn into a desired outcome (e.g. amazing query abilities, automated transform of data etc) when in fact much more is needed (much of it very hard).
Most specifically, schema stuff fundamentally depends on adoption a complex social process :-) People (including me) used to say: if only everyone would get stuff into RDF we'd get all of these incredible abilities to do X, Y, Z.
But a) most people found it tough to get stuff into RDF (because schemas etc are quite complex one you start looking at them) b) people can disagree on the right schema and its easy to make up new ones c) tooling around RDF was sometimes non-optimal and there were often some pretty awesome tools for doing things other (sometimes simpler) ways.
I mention this because you say:
Of course, adoption of json-ld will enable many other powerful applications, like analysis, aggregation, indexing, search, related content discovery, etc.
Is this really true? In some sense it is because "enable" is pretty non-committal :-) But I think you mean something stronger as in "If we do this [use JSON-LD], these other things will either follow directly or become much easier". I don't think that's true - you'll need to do a lot of other stuff - little of it explicitly dependent on JSON-LD.
Again let me reiterate I'm totally +1 on encouraging JSON-LD support but having been through a couple of rounds of linked data stuff before I'd like to be cautious in enforcing adoption on others. The reasons are simple:
In summary: we want to build the absolute "minimum viable product" and then iterate based on the new evidence we have of needs and what works.
Thanks @jden !
Hey @rgrp, we seem to be misunderstanding each other! D:
A. We're all happy for JSON-LD to be supported by Data Packages (in fact it does already) - rather, the question is whether we make it mandatory
Yep, we're all clear on this. :)
B. What does JSON-LD support entail? It isn't just about having an @context: on its own my guess that that is fairly useless and that to really get benefits from the linked-data side of things you'll need to all include quite a bit more (e.g. @type, schema namespaces etc). This is important because it means that if you don't do that other stuff @context will just be overhead and if you don't want it just to be overhead you'll have to do a lot more.
This assessment is incorrect :( -- @context
will not just be overhead if it exists on its own. As @msporny described above, you do not need to include all the other stuff to already leverage the power of the schemas. That is precisely the point I'm trying to convey. Just using the @context
, you can optionally enhance the semantic structure of your data in the context itself, using the exact same keys and values you are already using. You can assign @types
and more directly in the context file, without modifying the package file again. If the meaning of certain key/values change, you can re-map that directly in the context file, without changing any of the package files. :D
C. .... y point here isn't that I think we want to prevent those efforts but just that we should be aware that the real work in e.g. transform stuff isn't solved by adopting JSON-LD - you then need to establish a (widely-adopted) schema structure plus a bunch of tooling (that stuff needs to beats out just using python or javascript - for your transform system to get adopted it has to do better that writing my own scripts and this turns out to actually be pretty hard to do - as scripting is pretty powerful and libraries there are pretty awesome!).
Yep! very right. we're very much on the same page re: transforms. I'm not arguing from magical semantic fantasy land.
The transformer stuff I'm working on is a library that will require lots of work to make individual conversions work. Its selling points are (a) by adding a bit of typing framework, one can do conversion resolution and simplify complex data transforms; and (b) using npm buys you a sophisticated module framework (semver, easy installing, dependency management, portability, etc), and the benefit of wrapping existing tools in npm. See http://transform.datadex.io
I've seen that even small addition burdens can affect people's use so keeping to simplest we can is a good thing
Agreed on keeping it the simplest. Making registry tools insert it on publish burden users very little, if at all. Most probably won't realize or ever have to care. This isn't an increase in complexity for end users.
More broadly, I definitely understand your frustration with the semantic web world. It tends to be hyper-idealistic (hah), and out of sync with the real world. This is why Google and Facebook created schema.org, the knowledge graph, and open graph. But JSON-LD is not the complex, prescriptive RDF world. It is a very pragmatic way to sprinkle hypermedia references around APIs as they exist today, with the ability to add semantic meaning in tiny bits. Who knows what will finally bring the Linked Data world about, but it seems that JSON-LD : Linked Data :: HTML/HTTP : Hypertext/Xanadu
.
If everyone is using full JSON-LD then we have a great argument for upgrading the spec
Sure, let's go with that. :) Frankly, I've little interest left in finding resolution.
@jbenet great, it sounds like we have provisional agreement for near-term in that JSON-LD @context is added to spec but not made mandatory. (Also let me emphasize that I'm pretty aware of the general excellence of JSON-LD and have used extensively in a couple of projects!).
So to summarize:
@context
to the base spec in a "MAY" form. Would welcome your suggestions here on some text and then we can get it in ...For a thoughtful discussion that aligns with some of the points I'm making here see http://berjon.com/blog/2013/06/linked-data.html
Proposal: @context will added as a MAY item in next iteration of data package spec.
hi @rgrp @Jbenet what is the current status of this "MAY" item in next spec? I also wonder if you guys are lining up with the DCAT&PROV initiatives from W3C. DCAT and PROV are managing a similar usecase. Introduce a spec for metadata to describe datasets in RDF, which can be encoded as json-ld easily.
@pvgenuchten not sure-- @rgrp ?
@jbenet @pvgenuchten no-one commented on the proposal so nothing happened :-) Generally I want to see a fair number of comments on a proposal to put a change in.
@pvgenuchten @jbenet if someone could give me some sample language or submit a PR this can go in.
@rgrp sample language for the datapackage.json
?
@jbenet yes plus language for the actual spec proposal
@rgrp can you give me a precise example of what you want, say for another field, like version
?
@jbenet I'd be looking for relevant language to add to the spec to specify what property (or properties) to add e.g. @context
and how they should be used.
@rgrp the directions are too vague. do you want a patch to http://dataprotocols.org/data-packages/ ?
how much of a connection to @context
are you willing to have? proper JSON-LD makes this required. IIRC you disagree with forcing all data-packages to be linked data. So-- do you want it under the SHOULD
part of required fields? (I still think the point is lost, even if a little better).
Also, as mentioned before, @context
sort of obviates specs-- or rather, makes them machine-readable. So one way to go about this is to make the @context
point to a JSON-LD context file with the data-package spec, allowing users to point somewhere else if they're using a different spec. But you probably dont wan't that-- you probably want them to still strictly follow the data-packages spec (otherwise non-LD parsers would break)-- so maybe make it so any other context url needs to be derived from yours (have every field covered)?
It's also easy to treat all data-packages without an @context
as if they had one default @context
url-- namely yours.
Note this also needs a proper JSON-LD context file representing the machine-readable version of this spec. Hmmm, i don't have enough time to take this whole thing on right now-- do you have anyone else on the team that cares about linked data to work with me on this?
@jbenet I think it would come under the "MAY" style fields. I'm not sure I understand enough here to get the complexity. No problem if you don't have time for this right now and we can wait to see if someone else volunteers to get this in.
I note that the link to package.jsonld in the issue description now leads to a 404 page - is package.jsonld
still a thing?
@jbenet Would you be able to write out the data contained in a datapackage.json
file as Turtle, so we can see what the URLs for each property would be?
Yep still a thing. We haven't had time to give it a new home yet. We have been merging it with the work done by the CSV on the web working group: see http://www.w3.org/standards/techs/csv#w3c_all. More soon.
We have been merging it with the work done by the CSV on the web working group
In that case, this issue should probably be closed in favour of a new "Use the W3C Metadata Vocabulary for Tabular Data" issue.
@rgrp - Is the plan to transition to the CSV-WG's JSON data package description, when it becomes a Recommendation?
@hubgit no, no intention to transition to that spec as it isn't data package. Whilst directly inspired by Data Package and Tabular Data Package and i'm an editor I think it has currently diverged a lot.
So, still useful to get JSON-LD compatibility in here and this issue should stay open.
Hey guys!
I'm the author of datadex, and now working with @maxogden on dat. As a package manager for datasets, datadex uses a package file to describe its datasets. Choosing between
data-package.json
andpackage.jsonld
is hard:data-package.json
has been around longer, has a well defined spec, and many packages use it.package.jsonld
takes into account jsonld (which came out recently), and plugs into schema.org's schemas, for linked-data goodness. And at first glance, it seems most of what's indata-package.json
is inpackage.jsonld
.It's confusing for adopters to have two different specs. I think we should reconcile these two standards and push forward with one. Thoughts? What work would it entail?
To ease transition costs, I'm happy to take on the convergence work if others are too busy. Also, I can write a tool to convert between current
data-package.json
andpackage.jsonld
and whatever else.Cheers!
cc @rgrp, @maxogden, @sballesteros