Open eocarragain opened 5 years ago
Note: we may need a more general use-case for how to express sub-file/variable level metadata. Some concrete non-tabular examples would be good though
Discussed this on Editor's call 2019-08-08, and agreed it would be good to use the schema.org flavour if possible, e.g. Dataspice, Psych-DS
The table below compares the Frictionless Data tabular data specs with the schema.org variableMeasured property. It also shows the additional fields that the psych-ds team have added on top of their use of variableMeasured.
table_schema | schema:variableMeasured | psych-ds |
---|---|---|
dialect | ||
name | name | schema:name |
description | description | schema:name |
title | alternateName | |
type | type | |
type>rdfType | propertyId | |
format | ||
constraints>required | ||
constraints>unique | ||
constraints>minLength | minValue | schema:minValue |
constraints>maxLength | maxValue | schema:maxValue |
constraints>minimum | ~minValue | |
constraints>maximum | ~maxValue | |
constraints>pattern | ||
constraints>enum | levels | |
missingValues | na/naValues | |
primaryKey | ||
foreignKeys | ||
~type>rdfType | unitCode | schema:unitCode |
~type>rdfType | unitText | schema:unitText |
derivation | ||
imputation |
Notes:
This sounds a little like: https://www.w3.org/TR/tabular-data-primer/#string-restriction Why not reuse it? EDIT: Oh, I see you listed it above, but it covers all the constraints nicely...
@dgarijo agreed "csvw" is probably the most complete rdf-friendly way to do this. It also has the benefit that Google seem to be adopting it in the dataset search. However, we received quite strong feedback at Open Repositories that CSVW was 'too complicated' for most researchers & coders to pick up and use easily.
There may be ways around this in terms of how we present it in the RO-Crate spec, i.e. just provide examples of the most common cases, more or less equivalent to table-schema?
EDIT: if we did this, the psych-ds community might be a good test group as they are clearly struggling with the fact that schema.org doesn't quite do what they need
I don't think you need to adopt all of it, just the parts that cover your use cases (as you point out). In PROV we had like 3 main concepts and 8 relationships among them and people still said it was complicated...
Example of what the schema.org approach would look like in an RO-Crate context:
{ "@context": "https://w3id.org/ro/crate/0.3-DRAFT/context",
"@graph": [
{
"@id": "./",
"@type": [
"Dataset"
],
"hasPart": [
{
"@id": "./table.csv"
},
],
},
{
"@id": "./table.csv",
"@type": ["File", "Dataset"],
"contentSize": "383766",
"description": "A table capturing all my data",
"variableMeasured": [
{
"type": "PropertyValue",
"unitText": "metres",
"name": "wall_width",
"description": "The width of the wall in metres"
},
{
"type": "PropertyValue",
"unitCode": "CMT",
"name": "wall_height",
"description": "The height of the wall in centimetres"
},
{
"type": "PropertyValue",
"name": "datetime",
"description": "The date and time of the measurement"
},
]
},
]
Issue: in schema.org variableMeasured is only defined as a property of schema:Dataset, i.e. it cannot be used on an RO-Crate file as this maps to schema:MediaObject
EDIT: made the file a Dataset in the example above following @dgarijo's comments below
Are they disjoint (I don't see anything about that in schema.org)? If not, I don't see the problem in using them.
Would that mean making all ro-crate "files" be both schema:MediaObject and schema:Dataset?
not all of them, just the ones you want to describe with those properties. A research object may contain many files. Some of them may be datasets. Some may be Slides, workflows, SoftwareApplications...
Ok - made that change in the example above. Fact remains that schema.org doesn't cover a lot of common use cases for describing tabular data, so should we look at providing a simplified subset of CSVW more or less corresponding to table_schema?
I have a naive question: if the tabular format is an standard one, described in some ontology (but not at this granularity level), what should we do?
@dgarijo also mentions https://www.w3.org/TR/vocab-data-cube/
I have a naive question: if the tabular format is an standard one, described in some ontology (but not at this granularity level), what should we do?
@stian suggested conformsTo or schema:additionalType (or maybe schema:schemaVersion)
isatab is another example
Hello, we are really interesting into using Ro-Crate for a project and this use case would also be really important for us. Is there any news on this in general or integrating an existing solution as listed above? Thanks!
Thanks, @LauraWalters, for re-awakening this discussion - I've added this to the agenda for the RO-Crate Community Call this Thursday.
It would be good to hear more about your project's requirement on this, either in this issue or in the call.
Feel free to join if you have time, see #1 or https://s.apache.org/ro-crate-minutes for call details!
Also worth looking at GA4GH Search API specification, which include a JSON-based table definition.
@stain @LauraWalters @jmfernandez - just want to re-awake this discussion.
Has anyone done this for RO-Crate?
I have a simple example I want to code up from here: https://github.com/JTrippas/Spoken-Conversational-Search
How should I turn their text description of columns in a CSV into something in RO-Crate? Or should I justt create a text file with the text in it and link it as an encoding format.
@stain the link to GA4GH Search API specification above is 404.
@stain the link to GA4GH Search API specification above is 404.
@ptsefton I have been having a look, and the repo and the target file were renamed. Here you are a more stable link to the example https://github.com/ga4gh-discovery/data-connect/blob/3a9be1fab628d0278eedcb5e70bb7d55f7d0a081/SPEC.md#table-discovery-and-browsing-examples
From the spec pointed out by @stain and my point of view, a CSV/TSV can be semantically described on one hand by the needed parameters to open it in R, Python or similar (encoding, column separator, comment character, etc...), and on the other hand enumerating the name, syntactic or semantic type and logical position of the columns.
EDIT: I have just read @ptsefton answer at https://github.com/ResearchObject/ro-crate/issues/64#issuecomment-903470850 , and W3C tabular metadata spec seems to cover all these points.
@jmfernandez
How about we use W3C tabular metadata - but with its prefix so we get confused with different definitions of name
for example.
Here's an example reworked from the example 2:
{
"@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}]
"@graph": [
"@id": "tree-ops.csv",
"name": "Tree Operations",
"keyword": ["tree", "street", "maintenance"],
"publisher": {
...
},
"license": {"@id": "http://opendefinition.org/licenses/cc-by/"},
"dateModified": {"2010-12-31"},
"csvw:tableSchema": {
"csvw:columns": [{
"csvw:name": "GID",
"csvw:titles": ["GID", "Generic Identifier"],
"description": "An identifier for the operation on a tree.",
"csvw:datatype": "string",
"csvwrequired": true
}, {
"csvw:name": "on_street",
"csvw:titles": "On Street",
"description": "The street that the tree is on.",
"csvw:datatype": "string"
}, {
"csvw:name": "species",
"csvw:titles": "Species",
"description": "The species of the tree.",
"csvw:datatype": "string"
}, {
"csvw:name": "trim_cycle",
"csvw:titles": "Trim Cycle",
"description": "The operation performed on the tree.",
"csvw:datatype": "string"
}, {
"csvw:name": "inventory_date",
"csvw:titles": "Inventory Date",
"description": "The date of the operation that was performed.",
"csvw:datatype": {"base": "date", "format": "M/d/yyyy"}
}],
"csvw:primaryKey": "GID",
"csvw:aboutUrl": "#gid-{GID}"
}
]
}
Yes, I agree, if the standard already exists, we should reuse it. And btw, it could be a nice example about using annotations based on third-party ontologies along with RO-Crate. We could even consider the inclusion of a list of useful standards / ontologies, depending on the use case.
@ptsefton to have a go at reworking example with explicit @type
and flattened JSON-LD. This can become a new page in the spec.
Have tried this out.
A CSV file can have a schema
Here we see a column definition referencing one with a similar spelling with sameAs
"@graph": [
{
"@id": "#Action",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "The action the participant takes in that utterance, these actions are described in the code book and allow for reproduction of the results.",
"name": "Action",
"sameAs": {
"@id": "#Code"
}
},
{
"@id": "#Actor_pair",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "13 different pairs completed three tasks. This column distinguishes the different pairs for each task (1-13)",
"name": "Actor_pair"
},
{
"@id": "#Code",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "The action the participant takes in that utterance, these actions are described in Trippas et al. (2020)",
"name": "Code",
"sameAs": {
"@id": "#Action"
}
},
{
"@id": "#File.name",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Indicating the group number (2-14) and the date of the experiment.",
"name": "File.name"
},
{
"@id": "#Notes",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Comments such as the particular search is stopped by the user or researcher or extra notes which relate to the action of the participant regarding the search session. *not included in the \"SCSdataset.csv\"",
"name": "Notes"
},
{
"@id": "#Query",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "The reference to the information need participants are solving.",
"name": "Query"
},
{
"@id": "#Query.complexity",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "One of three levels, referencing the task complexity type (remember, understand, and analyse).",
"name": "Query.complexity",
"sameAs": {
"@id": "#Query_complexity"
}
},
{
"@id": "#Query.counter",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "A counter which keeps track of how many turns there have been between the participants in that conversation. For the initial data release only the first two turns are given. However, the first three turns are presented if the second turn is classified under the Meta-communcation Theme (See CHIIR 2017 paper for further information).",
"name": "Query.counter",
"sameAs": {
"@id": "#Query_counter"
}
},
{
"@id": "#Query_complexity",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "One of three levels, referencing the task complexity type (remember, understand, and analyse).",
"name": "Query_complexity"
},
{
"@id": "#Query_counter",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "A counter which keeps track of how many turns there have been between the participants in that conversation.",
"name": "Query_counter",
"sameAs": {
"@id": "#Query.counter"
}
},
{
"@id": "#Role",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Which of the participants is talking in that particular utterance. The roles are annotated as A_User (participant who has the information need which needs to be solved) and B_Receiver (person who has access the the computer and search engine).",
"name": "Role"
},
{
"@id": "#Start.time",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Start time of the utterance.",
"name": "Start.time",
"sameAs": {
"@id": "#Start_time"
}
},
{
"@id": "#Start_time",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "Start time of the utterance.",
"name": "Start_time",
"sameAs": {
"@id": "#Start.time"
}
},
{
"@id": "#Stop.time",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Stop time of the utterance.",
"name": "Stop.time",
"sameAs": {
"@id": "#Stop_time"
}
},
{
"@id": "#Stop_time",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "Stop time of the utterance.",
"name": "Stop_time",
"sameAs": {
"@id": "#Stop.time"
}
},
{
"@id": "#Sub_themes",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "Subthemes based on codes as described in Trippas et al. (2020)",
"name": "Sub_themes"
},
{
"@id": "#Transcript",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Transcripts of the utterance of the particular user in that particular times lot.",
"name": "Transcript",
"sameAs": {
"@id": "#Transcription"
}
},
{
"@id": "#Transcription",
"@type": "csvw:Column",
"csvw:datatype": "",
"description": "Transcripts of the utterance of the particular user in that particular timeslot.",
"name": "Transcription",
"sameAs": {
"@id": "#Transcript"
}
},
{
"@id": "#ffaa324f-bdec-4bf5-a260-62cc39580129",
"@type": "Person",
"affilitation": "\" https://ror.org/04ttjf776",
"name": "Paul Thomas"
},
{
"@id": "#schema-ConversationalSearchDataSet.csv",
"@type": "csvw:Schema",
"columns": [
{
"@id": "#Start.time"
},
{
"@id": "#Stop.time"
},
{
"@id": "#Query"
},
{
"@id": "#Query.complexity"
},
{
"@id": "#Role"
},
{
"@id": "#Action"
},
{
"@id": "#Transcript"
},
{
"@id": "#Notes"
},
{
"@id": "#Query.counter"
},
{
"@id": "#File.name"
}
],
"name": "Schema for ConversationalSearchDataSet.csv"
},
{
"@id": "#schema-SCSdata_v1.csv",
"@type": "csvw:Schema",
"columns": [
{
"@id": "#Start_time"
},
{
"@id": "#Stop_time"
},
{
"@id": "#Query"
},
{
"@id": "#Query_complexity"
},
{
"@id": "#Role"
},
{
"@id": "#Sub_themes"
},
{
"@id": "#Code"
},
{
"@id": "#Query_counter"
},
{
"@id": "#Transcript"
},
{
"@id": "#Actor_pair"
}
],
"name": "Schema for SCSdata_v1.csv"
},
Revisiting this as part of our work on the Text Commons RO-Crate profile.
Here's what we have now (including some new terms that are defined in a custom context)
A CSV file references a schema using the csvw:tableSchema
property:
{
"@id": "files/427/original_bad0fd7f9c918df1db8b6a5b39faec48.csv",
"@type": [
"File",
"Annotation"
],
"name": "Transcript of interview with Patricia Colless full text transcription (CSV)",
"encodingFormat": "text/csv",
"annotationType": [
{
"@id": "olac:Transcription"
},
{
"@id": "olac:TimeAligned"
}
],
"modality": {
"@id": "olac:Orthography"
},
"annotationOf": {
"@id": "files/503/original_779656ecdb38dfb06cee9440773692a7.mp3"
},
"language": {
"@id": "https://www.ethnologue.com/language/eng"
},
"csvw:tableSchema": {
"@id": "#dialog_schema"
},
"size": 54363
},
{
"@id": "#dialog_schema",
"@type": "csvw:Schema",
"name": "Table schema for dialogue transcript",
"columns": [
{
"@id": "#speaker"
},
{
"@id": "#transcript"
},
{
"@id": "#start_time"
},
{
"@id": "#notes"
}
]
},
{
"@id": "#speaker",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Which of the participants is talking in that particular utterance. ",
"name": "speaker"
},
{
"@id": "#transcript",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Transcription of speaker turn",
"name": "text",
"sameAs": {
"@id": "olac:Transcription"
}
},
{
"@id": "#start_time",
"@type": "csvw:Column",
"description": "Start time of the utterance.",
"name": "time",
"sameAs": {
"@id": "https://schema.org/startTime"
}
},
{
"@id": "#notes",
"@type": "csvw:Column",
"csvw:datatype": "string",
"description": "Additional information",
"name": "notes"
},
This has some advantages over the schema.org approach suggested above by @eocarragain many moons ago and used by Science on Schema.org.
On the other hand, the csvw spec is very complicated and very strict, and by bringing it into the the schema.org world we're not really using it properly - the sameAs reference to schema.org terms feels like a bit of a hack.
Maybe we should aim to bring the best of csvw into schema.org? (And while we're at it we could include worksheet as level of orgnization so we can deal with spreadsheets)
Including file content definitions is an important use case for our project. We've been working with concepts from the frictionless data framework to define file types that include many permutations of manually assembled and machine generated data files. A common scenario is for several different labs to produce assay data files that contain corresponding columns that could be aggregated for analysis, but there is no way to know that from the file headers. Using some of the concepts from frictionless, we define file types containing field descriptors, which can map to an rdf type so a data consumer will know which columns across various file types may be integrated. Though frictionless is geared toward tabular files, the field descriptors could be used to describe non-tabular data file contents as well.
Now that we are moving to RO Crates to package our metadata and files, we'd like to include these file type definitions in the crate metadata. Ideally, we'd like to be able to include a context entity for each file type and link these to the data files. The file type context entity would include the frictionless field descriptors. Following is an example of what this might look like (using "FrictionlessFileType" as a placeholder.) We are pretty new to RO Crates, so any advice is appreciated.
{
"@context": "https://w3id.org/ro/crate/1.1/context",
"@graph": [
{
"@id": "./",
"@type": "Dataset",
"datePublished": "2022-05-27T18:45:24+00:00",
"hasPart": [
{
"@id": "study_3-1/Food_Intake_9.3.2020.csv"
}
]
},
{
"@id": "study/Food_Intake_9.3.2020.csv",
"@type": "File",
"contentSize": "27710",
"name": "study/Food_Intake_9.3.2020.csv",
"frictionlessFileType": {
"@id": "food_intake_phenotype"
}
},
{
"@id": "food_intake_phenotype",
"@type": "FrictionlessFileType",
"encoding": "iso8859-1",
"format": "csv",
"hashing": "md5",
"schema": {
"fields": [
{
"id": "animal_diet",
"name": "Diet",
"type": "string",
"description": "Animal diet",
"rdfType": "http://www.ebi.ac.uk/efo/EFO_0002755",
"constraints": {
"required": "true",
"enum": [ "Envigo HFHS", "10% fat + fiber", "6% fat" ]
}
},
{
"id": "animal_weight",
"name": "Weight",
"type": "number",
"description": "Animal weight on day 0",
"rdfType": "http://www.ebi.ac.uk/efo/EFO_0004338",
"constraints": {
"required": "true"
}
}
]
}
}
]
}
This is an interesting approach I think Abigail - structurally it has quite a similar topology to the csvw approach but the documentation for Frictionless data is much more approachable.
A couple of comments - for RO-Crate the graph needs to be flattened - so all the fields with have to be separate entities with a @type attribute, FrictionlessField or maybe fd:Field if we used a namespace. Also the IDs should be URIs so, either #animal_weight or an http URI if you want to re-use them.
The constraints part is also problematic as for RO-Crate that would also need to be a separate entity - but in an RO-Crate dialect that could be direct properties of the field.
It could look something like this, maybe:
{
"@id": "#animal_weight",
"@type": "fd:Field",
"name": "Weight",
"fd:type": "number",
"description": "Animal weight on day 0",
"fd:rdfType": "http://www.ebi.ac.uk/efo/EFO_0004338",
"fd:required": "true"
}
OR another approach would be to put the frictionless schema in a file or at a URL and reference it - that way we don't have to force it into JSON-LD and it should work with FD tools. I think this is probably the way to go.
At the Language Data Commmons of Australia we are taking the second approach I mentioned above, and implementing frictionless table schemas included as a data entity in an RO-Crate - initial documentation is here in the draft profile for language resources.
{
"@id": "conversation1.csv",
"@type" :["File"],
"encodingfomat": "text/csv",
"name": "Transcript of conversation 1".
"conformsTo": {"@id" : "arcp://name,ausnc.ary/csv_schema")
}
{
"@id": "arcp://name,ausnc.art/csv_schema", ← REPOSITORY-UNIQUE NAME
"Type": "CreativeWork",
"name": "Frictionless Table Schema for CSV transcription files in the ART corpus"
"sameAs": "art_schema.json". ← Reference to the schema file above TODO: is this the best link?
"conformsTo": {"@id" : "https://specs.frictionlessdata.io/table-schema/")
}
{
"@id": "artSchema",
"@type" :["File"],
"encodingfomat": "text/csv",
"name": "Frictionless Table Schema file for CSV transcription files in the ART corpus".
"conformsTo": {"@id" : "https://specs.frictionlessdata.io/table-schema/")
}
Hi all - Psych-DS maintainer here, and I found this discussion following a meeting with some ROCrate people including @stain. Couple of points, and we'd be very happy to partner if there's a useful way to do so!
Psych-DS is designed largely to get researchers who probably should be storing CSV files in a well-structured directory, to actually do so. It essentially tries to provide an implementation of some standard best practices in file and directory structures (e.g. http://www2.stat.duke.edu/~rcs46/lectures_2015/01-markdown-git/slides/naming-slides/naming-slides.pdf) so that researchers can check whether they've succeeded in following those recommendations.
The minimal version of Psych-DS should be writable by hand, by a person who does not know anything about JSON files. It should be transparent to researcher what the information inside means, if that same person opens the json file. Or alternately, it should be implementable by a toolmaker who is not knowledgeable about linked data/JSON-LD/etc. but wants to provide users (who also don't know about those things!) with 'well structured' data that will be usable by other, similar tools.
The following really minimal schema.org/Dataset can provide an immediate benefit to the user - it can confirm whether there exists a CSV file, stored in the appropriate place, with a reasonable file name and an expected set of columns. If this is adopted widely, I would expect something like the following to be the extent of metadata for most datasets that exist, at least at first:
{
"@context" : "http://schema.org/",
"@type" : "Dataset",
"name" : "Psych-DS Example Dataset",
"description" : "This is a minimal example of a dataset for Psych-DS",
"variableMeasured" : ["study_id", "sub_id", "age_years", "responded", "trial_id", "response"]
}
And over time, we would be nudging researchers toward something more like:
{
"@context" : "http://schema.org/",
"@type" : "Dataset",
"name" : "Psych-DS Example Dataset",
"description" : "This is a slightly bigger and Dr. Seuss themed example of a Psych-DS dataset",
"author": ["Cat Inthehat", "Theodor Geisel"],
"citation": "Inthehat, C., & Geisel, T. (2019). Article title about something, 2(1), 45–54. https://doi.org/dostring.",
"schemaVersion": "Psych-DS 1.0",
"license": "https://creativecommons.org/licenses/by/4.0/",
"usageInfo": "This dataset can be freely reused, but here are some limitations on what this data can/can't actually tell us or how it should be interpreted.",
"variableMeasured" : [{"type":"PropertyValue","name":"study_id", "description":"This is the id code for the specific experiment the data point is from"},"more_variables"]}
I agree that schema.org/Dataset only sort-of meets our needs in terms of describing CSV files more fully, but it does provide a structured format that lets us validate against an externally established pattern while really minimizing the 'extra' that the user sees - there are essentially two lines of "magic", followed by no other explicit special syntax other than for variableMeasured.
If there was a subtype schema.org/TabularDataset we'd be all over it, as it might allow us to be more expressive about data type/formats & constraints - but something that's come up a few times is the idea of PDS a 'handoff' format - getting a researcher 80% of the way toward what a data curator/archivist might want it to be, with enough structure that e.g. the corresponding frictionless or ROCrate metadata can be automatically produced or inferred. One nice property of JSON-LD is the ability to store multiple objects in a single file - I wonder if all the JSON-LD metadata formats for tabular data could start supporting a common pattern along the lines of:
metadata.json
as an alternate name to the top-level metadata fileHello @eocarragain @dgarijo @stain, RO Crate peoples (and others). Just discovering and testing RO crate concept today. I found it fantastic and would love to adopt. I was really interested in mechanisms for a finer description of the Data entities within a RO crate. Last comments here are ~2 years ago. Have they been any progress lately on these aspects ?
@mekline
If there was a subtype schema.org/TabularDataset we'd be all over it, as it might allow us to be more expressive about data type/formats & constraints
I am mostly unfamiliar with Schema.org and RO Crate, sorry if this is off-topic but would https://schema.org/SpreadsheetDigitalDocument be a thing here ?
As a researcher working with tabular data, I want to be able to define the columns (description, data-type, valid values/ranges, etc.), so that I can provide a structured data dictionary.
Approaches elsewhere: