Closed xrotwang closed 3 years ago
Merging #49 (ed43c39) into master (f53756f) will decrease coverage by
0.12%
. The diff coverage is94.11%
.
@@ Coverage Diff @@
## master #49 +/- ##
==========================================
- Coverage 99.91% 99.78% -0.13%
==========================================
Files 14 16 +2
Lines 2268 2319 +51
==========================================
+ Hits 2266 2314 +48
- Misses 2 5 +3
Impacted Files | Coverage Δ | |
---|---|---|
src/csvw/frictionless.py | 90.00% <90.00%> (ø) |
|
tests/test_frictionless.py | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f53756f...ed43c39. Read the comment docs.
Thanks very much @xrotwang!
I am testing this out now. I'm running this on the same "Year,Location,Value" dataset you've got in the test. Here's my code:
from frictionless import describe_package
from csvw.frictionless import DataPackage
fr_package = describe_package('data.csv')
fr_package.to_json('frictionless_package.json')
csvw_package = DataPackage(dict(fr_package))
table_group = csvw_package.to_tablegroup()
table_group.to_file('csvw_package.json')
The output in csvw_package.json is:
{
"dc:source": "{\"profile\": \"data-package\", \"resources\": [{\"path\": \"data.csv\", \"stats\": {\"hash\": \"999c77f63bfb76ace71d6c754488c3ff\", \"bytes\": 131, \"fields\": 3, \"rows\": 9}, \"control\": {\"newline\": \"\"}, \"encoding\": \"utf-8\", \"dialect\": {}, \"schema\": {\"fields\": [{\"name\": \"Year\", \"type\": \"integer\"}, {\"name\": \"Location\", \"type\": \"string\"}, {\"name\": \"Value\", \"type\": \"integer\"}]}, \"name\": \"data\", \"profile\": \"tabular-data-resource\", \"scheme\": \"file\", \"format\": \"csv\", \"hashing\": \"md5\", \"compression\": \"no\", \"compressionPath\": \"\", \"query\": {}}]}",
"tables": [
{
"tableSchema": {
"columns": [
{
"datatype": "integer",
"name": "Year"
},
{
"datatype": "string",
"name": "Location"
},
{
"datatype": "integer",
"name": "Value"
}
]
},
"url": "data.csv"
}
]
}
The output in frictionless_package.json is:
{
"profile": "data-package",
"resources": [
{
"path": "data.csv",
"stats": {
"hash": "999c77f63bfb76ace71d6c754488c3ff",
"bytes": 131,
"fields": 3,
"rows": 9
},
"control": {
"newline": ""
},
"encoding": "utf-8",
"dialect": {},
"schema": {
"fields": [
{
"name": "Year",
"type": "integer"
},
{
"name": "Location",
"type": "string"
},
{
"name": "Value",
"type": "integer"
}
]
},
"name": "data",
"profile": "tabular-data-resource",
"scheme": "file",
"format": "csv",
"hashing": "md5",
"compression": "no",
"compressionPath": "",
"query": {}
}
]
}
It definitely seems to be converting the schema (at least from my limited understanding of CSVW). Would you expect any of the other metadata to come through?
Another comment, but probably unrelated to this PR: I'm also testing against other CSV files, and it appears to fail when it gets to a CSV file with a space in the column headers. Does CSVW disallow spaces in column headers? Here is a snippet of the error, when it gets to a column header of "Industry sector":
venv/lib/python3.8/site-packages/csvw/utils.py", line 44, in valid_re
raise ValueError(msg.format(value, attribute.name))
ValueError: Industry sector is not a valid name
@brockfanning re column names: Indeed, space is not allowed in CSVW column names, see https://github.com/cldf/csvw/blob/f53756f7d2888d621bf745791e85979d1907a445/src/csvw/metadata.py#L41-L43 - unless it's percent encoded. So Industry%20sector
would be ok.
What the conversion could do, is
name
property for the column.titles
property of the column.Then, columns could still be unambiguously identified. When reading the data with csvw
, the values would have to be accessed as row['Industry%20sector']
, though - or a reader would have to map column titles to column names first.
Regarding completeness of the conversion: It's still work-in-progress. The CSV dialect spec will have to be addded. Regarding metadata like the things in stats
, I'm not sure adding these makes a lot of sense. We'd have to add these using extra properties not specified by CSVW, and e.g. for stuff like hash sums, I'd rather have users wrap the whole package - i.e. data and metadata - in something like bagit.
Regarding completeness of the conversion: It's still work-in-progress. The CSV dialect spec will have to be addded. Regarding metadata like the things in
stats
, I'm not sure adding these makes a lot of sense. We'd have to add these using extra properties not specified by CSVW, and e.g. for stuff like hash sums, I'd rather have users wrap the whole package - i.e. data and metadata - in something like bagit.
Agreed - I definitely wouldn't want to suggest adding anything not specified by CSVW.
Can you help me out with a code example for if I wanted to add some arbitrary metadata to the table group? For example if I wanted to add "title" or "published", or even something not in the spec like "foo"?
CSVW's common properties are available as TableGroup.common_props
. This is a dict
you can just assign to:
tg.common_props['dc:title'] = '....'
tg.common_props['dc:date'] = '2021-01-06'
for properties which are not in any of the namespaces recognized by csvw, you'd have to use/make up full URIs, e.g.
tg.common_props['http://example.org/foo'] = '...'
or you just use dc:description
and assign multiple values using a JSON object or array:
tg.common_props['dc:description'] = {'foo': '...'}
@brockfanning re-reading the CSVW spec, I can't find a passage about limiting column names anymore. So I assume I added this limitation because column names may be used in URI-templates - e.g. in aboutUrl
or rowUrl
, and the URI-Template spec does indeed limit variable names.
However, it seems that this limitation is not enforced in the uritemplate package, which we use in csvw
:
>>> from uritemplate import URITemplate
>>> t = URITemplate('http://example.org/{ab c}')
>>> t.expand({'ab c': 'x y'})
'http://example.org/x%20y'
So I'd now lean towards lifting the artificial limitation of valid column names, and push the burden of synchronizing column names and variable names in associated URL properties to data producers.
So I'd now lean towards lifting the artificial limitation of valid column names, and push the burden of synchronizing column names and variable names in associated URL properties to data producers.
That works for me.
And thank you for the code examples!
I will continue to test out this PR as you add commits, and I'll try to get feedback on the output from my other team members who know more about CSVW. However from my perspective it seems to be working great, and you are definitely in a better position to say when it's done. But I will make sure to leave comments if I run into any issues. Thanks for your support!
@xrotwang Along the same lines as the question about common properties, what is the best way to add properties to the tableSchema, and what about adding properties to individual columns?
For example, what is the best way to get the "propertyUrl", "valueUrl", and "aboutUrl" in this kind of snippet?
{
"tables": [
{
"url": "data.csv",
"tableSchema": {
"columns": [
{
"name": "year",
"titles": "Year",
"datatype": "string",
"propertyUrl": "http://example.com",
"valueUrl": "http://example.com"
},
{
"name": "value",
"titles": "Value",
"datatype": "integer",
"propertyUrl": "http://example.com"
}
],
"aboutUrl": "http://example.com"
}
}
]
}
Getting schema objects by position
>>> from csvw import TableGroup
>>> tg = TableGroup.from_file('data.csv-metadata.json')
>>> tg.tables[0].tableSchema.columns[0].propertyUrl
URITemplate("http://example.com")
or by name
>>> tg.tabledict['data.csv'].tableSchema.columndict['year'].propertyUrl
URITemplate("http://example.com")
And setting:
>>> from csvw.metadata import URITemplate
>>> tg.tables[0].tableSchema.columndict['year'].propertyUrl = URITemplate('http://example.com/{year}')
>>> tg.tables[0].tableSchema.columndict['year'].propertyUrl.expand(list(tg.tables[0])[0])
'http://example.com/2012'
@brockfanning the conversion is fairly complete now. What's lacking is some datatypes - notably the various date/time types, and column constraints.
@xrotwang This is working beautifully for me, thank you!
@brockfanning just released this functionality with csvw 1.9.0
@brockfanning please have a look at this PR. I think this could be your starting point.
closes #48