OpenRefine / OpenRefine

OpenRefine is a free, open source power tool for working with messy data and improving it
https://openrefine.org/
BSD 3-Clause "New" or "Revised" License
10.74k stars 1.94k forks source link

Tabular dataset metadata - http://www.w3.org/TR/tabular-data-model/ #1096

Open psychemedia opened 8 years ago

psychemedia commented 8 years ago

If you're collecting enhancement requests, would it be useful to adopt the W3C Proposed Metadata Vocabulary for Tabular Data as part of import and export functionality?

(I'm not sure if things like http://dataprotocols.org/json-table-schema/ and http://okfnlabs.org/projects/goodtables/ are conformant with it?)

This could add metadata to a package file that identifies the columnar data types as well as providing to the option to export to, and import from, a CSV file packaged with a tabular metadata file.

danbri commented 8 years ago

Yes please :) /cc @rtroncy who also suggested this recently

tfmorris commented 8 years ago

Enhancement requests are always welcome.

OKFN (cc @rgrp) has suggested we support the Data Package standard (http://data.okfn.org/doc/data-package), which, as I understand it, the W3C work was based on before they diverged. They have funding from some foundation to help support this.

I know the W3C likes to make things all standardy and "improved" with their professional standards makers, which sometimes puts them in conflict with more pragmatic folks (vis WHATG), but in this particular case I haven't dug into it to see whether OpenRefine should be picking one to support, support both, or waiting until the pissing contest is over and there's a clear winner.

I don't know the whole backstory, but when I read minutes of a meeting where the original author isn't in attendance and everyone else votes to dismiss all his suggestions, I wonder what's really going on behind the scenes. http://www.w3.org/2015/09/23-csvw-minutes.html#res

@psychemedia Have you seen anyone else consider supporting the W3C standard if it gets approved?

thadguidry commented 8 years ago

@danbri Hasn't Gregg already put lots of work into it and all the tests ? to cover the uses cases that sort of originally inspired it all ? http://www.w3.org/2013/csvw/wiki/Use_Cases

danbri commented 8 years ago

The Data Package spec is fine enough for what it is, but it has no interest/capability for mapping from CSVs into entity/relationship graphs (RDF-style, Freebase/KG, schema.org etc etc.). My advice is that you'll find a bigger win with the W3C spec (but I was co-chair, so I have a natural preference).

Regarding "in attendance", unfortunately Rufus had many other calls on his time and did not habitually attend the CSVW WG calls throughout the life of the WG. It became clear very late in the process that many of the points of divergence from Data Package were not appealing to Rufus, and he filed a set of comments that the WG felt they largely couldn't adapt to. There isn't a lot of "behind the scenes" here from my p.o.v. - just editorial suggestions that arrived too late in the process - see https://lists.w3.org/Archives/Public/public-csv-wg/2015Aug/0003.html . That's life, getting complete agreement is hard - I hope there are no hard feelings on either side.

And yes, @gkellogg has done an amazing job with tests - see https://github.com/w3c/csvw/tree/gh-pages/tests

psychemedia commented 8 years ago

@tfmorris I'm not sure what the politicking is, but agree that OKF approach tends towards the pragmatic and W3C often sides towards supporting more formal semantic mappings which I think are harder for tinkerers to engage with when trying to implement according to conventions rather than standards.

In the world of realpolitik, I'd guess that the W3C approach will have advocacy support from the ODI in its training sessions, whereas OKF style training and products (CKAN) will use the DataPackage approach.

A problem with both approaches is that it adds complexity - CSV files are nice and convenient, and packages less so, particularly if they use a bespoke suffix rather than .zip for compressed bundles (which means many folk wouldn't know what to do with them..) For me, the packaging is more likely to be of use when it comes to putting together toolchains so that once I fix a datatype I can get it to persist across different applications while still using CSV as the transport.

In this respect, I'd personally be looking for signs of other folk using the tools that I use (python/pandas, R, OpenRefine) adopting one or the other, and going with the one that works easiest....

I'd also be looking for APIs starting to offer support by means of conventionally annotated URLs providing access to metadata files given a CSV URL (so in the W3C example, http://example.org/tree-ops.csv mapping on to http://example.org/tree-ops.csv-metadata.json using the -metadata.json annotation.

Whilst this is also a 'wait and see' approach, I guess it demonstrates that I don't personally currently value either approach to have opted for one every time I start to play with a new dataset, aside from occasional test cases.

However, if tools at the start of my workflow - such as OpenRefine - were to start producing metadata, I may be tempted to start making more use of it.

TBH, the biggest win for me at the moment would not necessarily be for my own workflows, but more for basic teaching/education - showing more conceptually how columns can be data-typed, and why this makes sense when you start analysing data using code. (Novices wouldn't necessarily think to cast from one datatype to another...)

satra commented 8 years ago

A problem with both approaches is that it adds complexity - CSV files are nice and convenient, and packages less so, particularly if they use a bespoke suffix rather than .zip for compressed bundles (which means many folk wouldn't know what to do with them..) For me, the packaging is more likely to be of use when it comes to putting together toolchains so that once I fix a datatype I can get it to persist across different applications while still using CSV as the transport.

the problem with CSV files is precisely the missing metadata. i work in a field where our columns come from many different sources of information. some of them are conceptually similar and some of them are the same but have different names, and some that are different and have the same name. whether one is sharing within a lab/project/organization, if the data generators change, it's essential that information exists to indicate what columns refer to, and that this information is machine accessible and not stored in a pdf somewhere.

i'll take a simple example of a column that arises in many of our CSVs: Age. Since age can be represented in many units, many levels of quantization (days/weeks/months/years/5years), someone somewhere has to understand what that Age column refers to. note that the datatype is not changing, it's really about what is being represented that changes. so, yes CSVs are simple, but they require a significant amount of human understanding and agreement to operate on. by structuring information, we can potentially remove the human in the loop and computational toolchains can benefit from structured information attached to the columns.

if semantics can be attached in machine accessible form, we will be in a much better position down the road in aggregating and recomputing data. in addition to the tabular data, there is also this:

http://www.w3.org/TR/vocab-data-cube/

and in someways this maps quite nicely to computational tools as n-d arrays, of which a table is a subset (or a transformation).

danbri commented 8 years ago

See https://github.com/w3c/csvw/blob/gh-pages/experiments/historical-weather-observation-dataset/README.md and nearby for CSVW and Data Cube experiments from @6a6d74

rufuspollock commented 8 years ago

@psychemedia @tfmorris just letting you know i'm following and ping if any questions.

danbri commented 6 years ago

Anyone still investigating this?

wetneb commented 6 years ago

@danbri not that I'm aware of, but there is some momentum around adding more project metadata in #1221. It would be great to see more progress on both of these issues!

danbri commented 6 years ago

Thanks. I was thinking that the rdf mappings part of CSVW -- i.e. https://www.w3.org/TR/2015/REC-csv2rdf-20151217/#example-events-listing -- might be interesting if people are using OpenRefine to bring data into Wikipedia, given that Wikipedia's datamodel is a variant on RDFish graphs.

wetneb commented 6 years ago

That's very interesting - I do see a lot of value for this in the context of Wikidata integration. This essentially means that we could create a Wikidata overlay model at import time, just based on the tabular metadata (if wikidata properties are provided).

danbri commented 6 years ago

Yeah, I think so. There's a horribly underdocumented js experiment here btw - it reads some CSVW metadata and the .csv itself, then injects the results into the HTML document. At Google we can even consume that. Demo uses schema.org but it could equally be Wikidata properties and types. @thadguidry and co have also been working on mappings between the two. Would be lovely to have an OpenRefine story around such things...

thadguidry commented 6 years ago

@danbri yes Dan. Don't worry, will support CSVW metadata... just gotta get some more contributor help on things like that. But first things first... laying the groundwork for broader metadata support, then pretty much anything can be done at importer and exporter time. Its always been in the 'plan' and I'll continue to push for this as well :) and this is the issue and story for that.

rufuspollock commented 6 years ago

@thadguidry the Data Package specs are also now v1.0 https://blog.okfn.org/2017/09/05/frictionless-data-v1-0/

There's a lot of tooling support including JS, Ruby, Python, R etc http://frictionlessdata.io/tools/.

Also goodtables (which is python) has developed a lot https://github.com/frictionlessdata/goodtables-py (and there is a partial JS implementation of this in https://github.com/frictionlessdata/tableschema-js).

/cc @pwalsh @callmealien

danbri commented 6 years ago

https://youtu.be/MsuXqf9wog0

thadguidry commented 6 years ago

Mailing List thread side discussion about if we support CSV on the Web AND Data Package family spec https://groups.google.com/d/topic/openrefine-dev/6UU_w98PcJs/discussion

Waiting to hear back from @danbri and @rufuspollock on that thread.

rufuspollock commented 6 years ago

@thadguidry i'm not on that mailing list atm so i'm commenting here:

I contributed to both specs in a significant way: in fact, the CSV on the web started out largely based on Table Schema and Data Packages.

Table Schema (and Data Package) are a lot simpler and support pretty much all the features i can imagine you wanting.

There is also extensive and mature tooling in a lot of languages with a lot of roading testing: http://frictionlessdata.io/software/ - here are some specific examples http://github.com/frictionlessdata/tableschema-js https://github.com/frictionlessdata/tableschema-py https://github.com/frictionlessdata/tableschema-java

We've also seen significant community adoption e.g. into pandas https://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.build_table_schema.html - you can read more at http://frictionlessdata.io/articles/

jackyq2015 commented 6 years ago

@rufuspollock Is there any online demo I can try? You mentioned you contributed to both. Just wondering what happened to W3 one? any comparison for the two or other standard? Thanks

rufuspollock commented 6 years ago

@jackyq2015 - you mean examples of the specs in action?

Some examples of tabular datasets as tabular data packages which include table schemas for the data files: http://datahub.io/docs/data-packages/tabular

You can also find many more examples at https://datahub.io/core or https://github.com/datasets (the github datasets are the raw form that get published to https://datahub.io/core)

Some discussion of comparison of the two: https://discuss.okfn.org/t/w3c-csv-for-the-web-how-does-it-relate-to-data-packages/1715/8

Let me know if you need more!

jackyq2015 commented 6 years ago

@rufuspollock I saw there is a "rdfType" in the "Table Schema spce". Just wondering if there is any further detail around this? As far as I know, the CSVW covers the RDF.

Another question is that the criteria to choice between implementing the "Data Package spec" and "Table Schema spec" for the OpenRefine. Is it feasible that we start with the "Table Schema spec" firstly then move to "Data Package spec"? How hard it will be?

Thanks!

wetneb commented 6 years ago

@jackyq2015 as I explained earlier I don't think there is any discussion to be had about "Data Package versus Table Schema". These two things just do not do the same thing at all, and it does not make any sense to compare them: it's like discussing "XML vs XSD" or "JSON vs JSON Schema"…

So:

Is it any clearer?

thadguidry commented 6 years ago

I spoke to Jacky and the approach I think that makes sense is...

The 3 formats that we see in the wild currently used are Data Package, CKAN (simple JSON with 1 "meta" object and 1 "data" object), and CSVW.

  1. new Importer for metadata that can be based on our JSON importer (this can handle ANY JSON-based metadata formats.
  2. new importer/exporter for CSVW since this supports the widest Linked Data efforts through JSON-LD format with Schema.org semantics.
  3. (optional) new importer/exporter for Data Package if time allowed or we feel the community is vocal to also want it.
  4. (optional) XML based metadata importer that can be based on our XML importer. On Data.gov and Data.gov.uk and other's I didn't come across many remaining XML based metadata sets... looks like most are getting converted to JSON based at a rapid pace.
  5. Storing the metadata should use CSVW format to allow the widest amount of metadata properties and Linked Data capabilities. This will benefit everyone, Scientists, Researchers, Wikidata, Librarians, you name it and allow OpenRefine to be a great one-shot power tool to pull in regular CSVs and annotate. The annotation of Columns (fields) can be done with a nice GUI later on, but for immediate Proof of Concept now we can just start with a raw input box for the JSON-LD format and use Gregg Kellogg, Manu Sporny's, Markus Lanthaler's work with jsonld.js processor which has an API

All in favor ?

wetneb commented 6 years ago

I am very confused - we don't seem to be talking about the same thing at all!

I don't think an importer for Data Packages should be based on the generic JSON importer - they don't do the same thing at all, and basically no code can be shared between the two.

rufuspollock commented 6 years ago

Another question is that the criteria to choice between implementing the "Data Package spec" and "Table Schema spec" for the OpenRefine. Is it feasible that we start with the "Table Schema spec" firstly then move to "Data Package spec"? How hard it will be?

As @wetneb says Data Package and Table Schema aren't really substitutes:

@jackyq2015 as I explained earlier I don't think there is any discussion to be had about "Data Package versus Table Schema". These two things just do not do the same thing at all, and it does not make any sense to compare them: it's like discussing "XML vs XSD" or "JSON vs JSON Schema"…

The Table Schema describes the table columns itself.

Then, if you want it, you have a Tabular Data Resource http://frictionlessdata.io/specs/tabular-data-resource/ which describes a single tabular file and uses the Table Schema to describe the columns.

Finally, if you need it, you have the (Tabular) Data Package that describes an entire dataset made up of multiple tables.

@wetneb puts this very well:

So:

  • OpenRefine's model should be updated to store metadata following the Table Schema specs
  • We need new importers and exporters following the Data Package specs

@rufuspollock I saw there is a "rdfType" in the "Table Schema spce". Just wondering if there is any further detail around this? As far as I know, the CSVW covers the RDF.

This allows you to tie an RDF type to a given column in the table schema. Its super simple to use. In my experience very few data folks use RDF on a regular basis so I think it is relatively less important.

rufuspollock commented 6 years ago

@thadguidry

I spoke to Jacky and the approach I think that makes sense is...

As per @wetneb I'm not sure i follow this approach. As he says:

I don't think an importer for Data Packages should be based on the generic JSON importer - they don't do the same thing at all, and basically no code can be shared between the two.


You also mention re the exporter:

new exporter for CSVW since this supports the widest Linked Data efforts through JSON-LD format with Schema.org semantics.

I note that CSVW and Data Package aren't especially compatible. Do if you don't also export Table Schema / Data Package stuff people won't be able to consume that.

Overall, the major benefits of Table Schema and Data Package is they are simple yet relatively powerful -- and the simplicity means they are much easier to support and use (meaning more other consumers and tooling out there).

thadguidry commented 6 years ago

@wetneb For 1. I was talking about a different case that comes up concerning the use of the generic JSON importer we have...There are cases where in the wild I have seen and demonstrated to Jacky that metadata is found in freeform Json records and XML (basically the datasets out there that are pre-2005). I showed him one last night. We'll want to support freeform Json metadata mapping as well, but that can be a longer term goal. And that freeform JSON metadata mapping can be done similarly to our JSON importer selector.

For CSVW and Data Packages support... yes, those are new importers. I have updated my comment above.

thadguidry commented 6 years ago

@rufuspollock the problem is that there's not good support for Linked Data within Data Package / Table Schema from what I see. I.E. only the rdfType:

jackyq2015 commented 6 years ago

@wetneb I am not try to compare the Table Schema and Data Package. I understand they are different things. My question is that does it make sense to do the "table schema" firstly then move to "data package" since it can be put in the "data package" later on. Also Thad and I share the same concern the limitation of the rdf support.

danbri commented 6 years ago

+1 for https://github.com/OpenRefine/OpenRefine/issues/1096#issuecomment-345722334 proposal from @thadguidry

rufuspollock commented 6 years ago

@thadguidry @jackyq2015 what rdf support do you need? Do your users regularly use RDF types in their work with OpenRefine and how do they do that? That would help get clearer on whether what is there already in TableSchema is sufficient.

I also note that Data Package / Data Resource etc are all extensible with your own properties if you want.

@jackyq2015

My question is that does it make sense to do the "table schema" firstly then move to "data package" since it can be put in the "data package" later on.

You can definitely do that and might make a lot of sense as an initial approach.

danbri commented 6 years ago

Note that anyone mapping into the Wikidata data model is in "RDF plus some stuff, minus some stuff" territory. It would be interesting to get more explicit requirements from the Wikidata side in terms of mapping in not just basic facts but qualifier and sourcing/provenance data too. While a simple CSVW RDF mapping like http://danbri.org/2016/PublicToilets/mapping.json can show how to turn tables into triples, I expect for using OpenRefine to source graph data for Wikidata we'll want to pass along per-factoid provenance too somehow. @thadguidry - have you any thoughts on this?

thadguidry commented 6 years ago

@danbri per factoid (individual provenance on an OpenRefine cell), seems way too heavy ? OpenRefine works on mass edits against columnar operations. I think our team was hoping to only have to deal with provenance at the Column level only, as your PublicToilets example JSON shows (a column of cells that share the same metadata/provenance). For the qualifiers, yes those could be stored also (it could be treated as Parent/Child column structures (grouping columns), which our data model doesn't support yet but many datagrid/storage models support grouping columns and operations across them. I have worked through previous designs and have thoughts about that, I'm just not sure on the best approach yet. Columnar Grouping definitely handles that case however, but so does our Record mode in a way. There's even Row Grouping options such as this example http://mleibman.github.io/SlickGrid/examples/example-grouping.html

Antonin @wetneb had more thoughts on qualifiers, but my hunch is that Columnar Groups (sub data for a column) might also help there as well. And then having a nice presentation for that in our UI, column folding, layered slide out panels, etc. all come to mind.

danbri commented 6 years ago

@thadguidry yes in many case provenance would be at a higher granularity. But I understand Wikidata at least in theory encourages per-fact sourcing references, e.g. https://www.wikidata.org/wiki/Q243 has two references for the Eiffel Tower's height. I didn't mean to suggest that every cell in the original table need have different provenance to pass along.

thadguidry commented 6 years ago

@danbri Righto. Yeah, I think just the Parent/Child column structuring (sub data for a column) will work out well in the long term for this edit and mapping visualization need at individual cell levels. Our current Edit for a cell which is a rather plain text box...will change to be much more feature rich with an optional docking panel...rather than the current limiting popup dialog overlay.

If you haven't heard me say it before...I want OpenRefine to be the Photoshop of data cleaning tools. We're just not there yet :)

rufuspollock commented 6 years ago

@thadguidry

@rufuspollock the problem is that there's not good support for Linked Data within Data Package / Table Schema from what I see. I.E. only the rdfType:

What exactly are your requirements? Is this an internal modelling system or for import / export (i assumed the latter).

And who do you anticipate using OpenRefine for these use cases. Most data wranglers or data scientists I've worked with rarely use linked data annotations to their tabular data (and if they did i'd guess it would be around the column semantics).

thadguidry commented 6 years ago

@rufuspollock (at import/export) Right and that's part of the problem, data sharing takes a hit because sometimes the data is not well understood ! The idea is that later, they would have the ability to pull in metadata through our Wikidata Reconciling process and apply to columns or groups of cells to enrich their dataset. @wetneb can also speak about some of this. The process is described here https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation

wetneb commented 6 years ago

@thadguidry I think this debate between CSVW and Data Packages only makes sense if we come up with concrete integration plans.

@danbri I've been working on schema alignment for Wikidata (a prototype is available in the wikidata-extension branch of this repository). As you know RDF isn't really a first-class citizen in Wikidata so I have taken the path of not reusing RDF schemas at all. There is currently no way to push some RDF triples directly to Wikidata, and even if there were such a mechanism you don't really want to ask users to do the RDF serialization manually (they know the Wikibase UI, not the RDF serialization specs).

As far as I can tell, the sort of RDF schemas CSVW can handle have a particular shape. Quoting https://www.w3.org/TR/2015/REC-csv2rdf-20151217/:

The CSV to RDF translation is limited to providing one statement, or triple, per column in the table

Virtual columns make it possible to extend that a bit, but this is still far from the level of genericness that is allowed by the (unmaintained) RDF extension for OpenRefine, or the (fledgling) Wikidata extension. So I'm not sure how we can make these things interact. Given a CSVW, we can generate a RDF or Wikibase schema, but for the reverse direction it's less clear.

In theory, the "one column = one property" mantra would work quite well for the reconciliation API, as each column can be mapped to a property of the reconciliation service. But we need to step back and think about the scenarios in which this integration could work. Intuitively, the user would import in OpenRefine a CSV with CSVW metadata. OpenRefine would store the propertyUrls in its model. When opening the reconciliation dialog (for a reconciliation service compatible with the namespaces used in the CSV), the columns would be automatically matched against the target properties. There are multiple issues with that workflow:

So, I don't really see any compelling use case that would really motivate support for CSVW metadata in OpenRefine's model. I think it would make a lot of sense for extensions like the RDF extension or the Wikidata extension to provide a best-effort import/export of their schemas to CSVW metadata files, but that does not require any change in OpenRefine itself. I don't really see the case for CSVW import: if you have a CSVW at hand with nice metadata, then most likely you don't even need OpenRefine to clean it up: it is already clean and aligned.

So, if mapping columns to RDF properties is not that useful, what about validating columns against particular data types? We need to figure out how we want to integrate these type constraints in OR. One simple way to do that would be to let the user define these types for each column (via a new action in the menu for that column, for instance), choosing from a predefined set of types (integer, string matching a regex, and so on: I assume the specs of CSVW and Data Package roughly agree on this set of types). There could also be a feature to auto-guess the types of all columns (pretty much in the same way the type of a cell can be guessed from its content), provided either as an operation or as an import-time option. Then, we could provide a new facet that would single out the rows where the cell contents don't match the type of a particular column. This would help the user fix these issues via transformations, blanking, etc. Once this is done, the project could be exported as a Data Package / CSVW with the corresponding types reflected in the tabular metadata. Of course we could provide importers for Data Packages and CSVW. I suspect many existing importers could also be adapted to expose the data types they know of in column metadata.

So, the case for column datatypes seems a lot clearer to me - I believe it should be possible to provide a streamlined and transparent experience to users, with clear added value: format validation is arguably an important first step before alignment with linked data sources, so providing better core support for that is a real gain.

I think that kind of makes @rufuspollock's point for Data Packages, because that's exactly the type of metadata that they support. But I really insist: "CSVW vs Data Packages" is not the right question to ask. The right question is: what extra data do we want to store for each column in an OpenRefine project, and how is that going to be beneficial to the user. Once this is decided, we can add new importers / exporters and update the existing ones so that they play nicely with these new metadata fields. My answer to that question is: we should clearly store data types, and I don't see any seamless user story backing up the proposal of storing RDF mappings as proposed by CSVW. If you see one, please describe it!

pwalsh commented 6 years ago

But I really insist: "CSVW vs Data Packages" is not the right question to ask. The right question is: what extra data do we want to store for each column in an OpenRefine project, and how is that going to be beneficial to the user.

Bravo.

thadguidry commented 6 years ago

@wetneb What the community has been saying is that against any particular column...there is a need for

  1. Storing Data Types (Boolean, String, Number, etc) - Your case for Data Packages, which is unclear if it supports 2 very well.
  2. Storing Semantic Types (CreativeWork, Person, Event, PropertyValue, VideoGame, etc) - Our case for CSVW which supports both 1 and 2

Does Data Packages specification support both that our users want ? I have yet to get a good answer.

But I'm flexible, if the community has suddenly said they don't care about 2, then I need names and numbers of those folks, just for posterity sake when they come back to us and say, oh sorry, yeah we need that also now. :)

wetneb commented 6 years ago

@thadguidry my point is that it's not enough to just say that we want to add support for "storing semantic types". We can store a lot of things, but that will only make sense if that is part of a workflow. So you need to come up with particular use cases, describing why storing this sort of metadata will ease these use cases. When are these types going to be set? When are they going to be used? How will that semantic type interact with the reconciliation type that reconciled columns have? And so on.

By the way, both CSVW and Data Package can express RDF types for columns. As far as I can tell, the main thing that CSVW has and Data Package does not is mapping columns to RDF properties. So both formats can represent both data types and semantic types.

Again, can we please move the discussion away from this OKFN vs W3C battle and focus on discussing actual changes in OpenRefine? @thadguidry can you please describe how you would like to see semantic types associated with columns? Let's get a bit more concrete!

thadguidry commented 6 years ago

@wetneb Breathe Antonin. :) There's no battle here. Not sure if you know, but everyone here @danbri @rufuspollock myself, etc are actually friends on the web. All discussion is equally important and OpenRefine values all opinions. Sorry, I thought the use case was clear in this discussion and by the OP and your initial thumbs up reaction that knowing and understanding more about a dataset was important and being able to share that knowledge.

The association of a Semantic Type to a column is just a simple mapping. (the details of how is where Jacky and I and yourself will need to make a choice based on a format, but that format I think should be JSON based and both CSVW and Table Schema have us covered there)

The Semantic Type can be set manually or discovered and presented automatically. Manually, by a user through a Add Metadata column option. Jacky supported that idea and I think you did also. Automatically, a few things can be done as well. Here's one idea that I felt was useful based on other tools (like DBpedia's) that do something similar. After Reconciling, a user can be presented with a dialog that shows perhaps 2-3 of the top Semantic Types by percentage which can be based on the Wikidata, "instance of" property https://www.wikidata.org/wiki/Property:P31 and allowing the users to choose and apply the appropriate or most fitting one. Say you have a data set with a bunch of Sports Governing bodies...and the auto meta dialog says that most of them in your OpenRefine column are also instances of nonprofit organization or just simply organization. The user sees that in the dialog and then can apply that Semantic Type of "Organization" a subclass of nonprofit organization. Sometimes the discovery can even be reversed, where the user thought he had a generic dataset of Organizations and lo and behold he finds out through that dialog that 98% of them are actually Sports Governing bodies as well !

The Semantic Type can just be the Wikidata URL for the QID, for example, International Sport Governing Body https://www.wikidata.org/wiki/Q1346006 Applying the "mapping" portion through a property can be done on both CSVW and Table Schema

{
      fields: [
        {
          "name": "Organizations",
          "type": "string",
          "rdfType": "https://www.wikidata.org/wiki/Q1346006"
        }
        ...
      }
    }

Also, discovery of an additional Schema.org mapped Type "could" also be done as well through the Wikidata "equivalent class" property. As shown in this example of Organization https://www.wikidata.org/wiki/Q43229 where there is mapping to DBPedia, Schema.org, W3

@rufuspollock I would hope that Table Schema and parsers out there support Multi Typing like CSVW does ? Do you know ?

{
      fields: [
        {
          "name": "Organizations",
          "type": "string",
          "rdfType": ["https://www.wikidata.org/wiki/Q1346006","http://www.schema.org/Organization"]
        }
        ...
      }
    }

As @danbri and @rufuspollock and I share the same concern, the importance of this Issue #1096 is actually more about adding metadata and exporting it to enhance knowledge of a dataset that didn't exist before. And furthermore, nicely importing that knowledge back into Wikidata, as all of us agreed already "that would be an awesome thing to be able to do".

@wetneb which parts above do I need to flesh out more that might still be unclear, just let me know.

wetneb commented 6 years ago

Great! Yeah we are all friends, I know! Good to finally see something more concrete.

So, currently, OpenRefine already stores the type used to reconcile a column in the column metadata (together with the reconciliation statistics). This metadata is not exposed at all in the UI (and is not editable), but I have relied on that to implement the data extension API (the property suggestion feature is based on that).

So, the question is:

So it is more about exposing this existing info better rather than creating a new field… Not that I'm entirely happy with the way this is currently stored (this is not exposed in ReconConfig but only StandardReconConfig, that should be adapted I think).

wetneb commented 4 years ago

Lowering priority, since we need to come up with concrete user stories of how this integration should work and in which situations it would be beneficial for users.

tfmorris commented 4 years ago

I'm not sure I followed all the 2017, but in answer to:

So, the question is:

  • do we need a second field for the RDF type? Or can we reuse the existing reconciliation type?

Use the existing reconciliation type. It is the semantic type (from before there was a Semantic Web).

  • would it make sense to make the current reconciliation type editable? I guess we could reuse the type detection heuristics of the reconciliation dialog and let the user set the type afterwards (from one of the proposed types or to a manual type)

Normally the way I'd expect the user to signal this is by re-reconciling against a different type. Otherwise you risk the type and the reconciled entities getting out of sync. The one edge case is where the chosen entity doesn't have the type being reconciled, due to incomplete data at Wikidata (or other reconciliation data source). In this case you actually want to update the entity to apply the type, but we don't have a good way of signalling that.

  • does it make sense to store RDF types on columns that are not reconciled? As far as I can tell, the use cases you mention would not really apply to unreconciled columns.

The case where it makes sense if the user used "Add column based on reconciled values" in which case it would make sense to store the semantic type e.g. if you extend your Author column which has been reconciled with a Works Written column, the new column takes the type "Written Work"

Of course, the only reason that OpenRefine would implement support would be if folks actually needed and were going to use it. With the hindsight of 4 1/2 years since it's promulgation, how widely implemented is it? What are the major producers? Consumers? Traffic on the mailing list doesn't seem to indicate widespread adoption.