Closed brockfanning closed 3 years ago
There are plenty examples. Pretty customized metadata files are used in the concepticon project. Just check there for the folder concepticondata/conceptlists/
where each TSV file is accompanied by a metadata file.
@LinguList Thanks for the quick reply! I apologize my question was unnecessarily vague. I'm actually looking for an example of programmatically (with Python code) creating metadata from scratch, using this library. To elaborate, I'm looking for either of these things, in order of preference:
The first item above is my ultimate goal, but if that is not possible with this library then I could get there with the help of the second item.
Hm. For 1, i.e. infering metadata from CSV you may be better off using the frictionless framework - even if you'd want to go with CSVW, which would require a simple transformation of the JSON metadata.
@xrotwang That's a great lead, thank you! Do you happen to know of any Frictionless-to-CSVW metadata converters out there? As you say it should be simple, though I would need to ramp up on both specs.
I'd be willing to help with writing the converter :)
@xrotwang That's amazing! Should I kick things off by creating a separate repo for this, and then ping you when I get stuck? Or were you imagining this being added to an existing repo?
As far as I'm concerned, such a converter could live in this repo as well, possibly as class method Dataset.from_frictionless_metadata
or similar.
That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated!
Will try to make a start this week.
brockfanning notifications@github.com schrieb am Mo., 4. Jan. 2021, 19:49:
That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cldf/csvw/issues/48#issuecomment-754149212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKFT67WRBTIIP4VQOFTSYIET3ANCNFSM4VTDE4HA .
Skimming the documentation at https://frictionlessdata.io/tooling/python/describing-data/#describing-schema it seems that frictionless
can infer the CSV dialect and the data types (partially), but not things like foreign key relations. But we could start out with the output of frictionless describe PATH/TO/TABLE --json --source-type package
, i.e. something like
{
"profile": "data-package",
"resources": [
{
"path": "forms.tsv",
"stats": {
"hash": "91c89a7d4fe4d5d55ad8a383a64ea047",
"bytes": 192127,
"fields": 13,
"rows": 2249
},
"control": {
"newline": ""
},
"encoding": "utf-8",
"dialect": {
"delimiter": "\t"
},
"schema": {
"fields": [
{
"name": "ID",
"type": "string"
},
{
"name": "Language_ID",
"type": "string"
},
{
"name": "Parameter_ID",
"type": "integer"
},
{
"name": "Value",
"type": "string"
},
{
"name": "Comment",
"type": "any"
},
{
"name": "Source",
"type": "string"
},
{
"name": "Graphemes",
"type": "string"
},
{
"name": "Profile",
"type": "string"
}
]
},
"name": "forms",
"profile": "tabular-data-resource",
"scheme": "file",
"format": "tsv",
"hashing": "md5",
"compression": "no",
"compressionPath": "",
"query": {}
}
]
}
and convert this into a csvw.TableGroup
- which could then be enhanced programmatically.
@brockfanning could you give an example of the kind of CSV you'd want to create CSVW for? Does it use any naming scheme to give type or foreign key hints?
@xrotwang In my case it's pretty simple: each data package has exactly one standalone CSV file. So I don't believe there is any concern about foreign key hints (at least in my case). We don't have any naming scheme related to types. Here's an example:
Year | Location | Value |
---|---|---|
2010 | 10 | |
2011 | 20 | |
2012 | 30 | |
2010 | Urban | 12 |
2011 | Urban | 22 |
2012 | Urban | 32 |
2010 | Rural | 14 |
2011 | Rural | 24 |
2012 | Rural | 34 |
Ok, so below is what frictionless describe
makes off such data. It's enough to use csvw
to read the data correctly, i.e. we have the info about
Once we can read the data with csvw
, we could add information like
required
- if there are no empty values for a columnunique
minimum
or maximum
.OTOH, that would make the data more difficult to edit - e.g. adding a row with a value below an inferred minimum would make the data invalid.
So I guess, we could/should distinguish two use cases:
---
metadata: test.csv
---
compression: 'no'
compressionPath: ''
control:
newline: ''
dialect:
delimiter: '|'
encoding: utf-8
format: csv
hashing: md5
name: test
path: test.csv
profile: tabular-data-resource
query: {}
schema:
fields:
- name: Year
type: integer
- name: Location
type: string
- name: Value
type: integer
scheme: file
stats:
bytes: 131
fields: 3
hash: b36e8c21563ab32645052c11510bddb7
rows: 9
@xrotwang Just my 2 cents: If it helps keep things simple I would be fine with assuming that all inferring of the schema, and adjustments to the schema, will happen in the Frictionless object, before it gets to this library. For example if the schema needs constraints added, they'd be added according to the Frictionless table schema. In that case this library could focus on converting (as faithfully as possible) that metadata to meet the CSVW spec.
Side note: I agree with your distinction of use-cases. I expect many providers want to publish CSVW just to make their data "interoperable". While others want the metadata for validation so that they can avoid maintenance problems in the future. My users definitely need the "infer from CSV" interoperability approach (hence this issue). To that end I'm imagining automating some of the adjustments you mention, like minimum/maximum, and turning values into an "enum". But again I am fine with doing all of that to the Frictionless object before sending it to this library.
Of course, the adjustments can also be made in the CSVW metadata - either manually editing the serialized JSON, or programmatically on the csvw.TableGroup
object, which csvw.TableGroup.from_frictionless_metadata
would return.
And I agree, getting the simple case up and running would not only be the first step, but presumably useful functionality already. I have a proof-of-concept in my head :) - hope to find the time later today to push for you to review.
Hello,
I was wondering if it would be possible to provide an example of creating the metadata from scratch. My goal is to create a metadata file by programmatically examining a CSV file to determine the schema. No worries if this is out of scope for the library.
Thank you!