cldf / csvw

CSV on the web
Apache License 2.0
37 stars 6 forks source link

first stab at a fritionless converter #49

Closed xrotwang closed 3 years ago

xrotwang commented 3 years ago

@brockfanning please have a look at this PR. I think this could be your starting point.

closes #48

codecov-io commented 3 years ago

Codecov Report

Merging #49 (ed43c39) into master (f53756f) will decrease coverage by 0.12%. The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #49      +/-   ##
==========================================
- Coverage   99.91%   99.78%   -0.13%     
==========================================
  Files          14       16       +2     
  Lines        2268     2319      +51     
==========================================
+ Hits         2266     2314      +48     
- Misses          2        5       +3     
Impacted Files Coverage Δ
src/csvw/frictionless.py 90.00% <90.00%> (ø)
tests/test_frictionless.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f53756f...ed43c39. Read the comment docs.

brockfanning commented 3 years ago

Thanks very much @xrotwang!

I am testing this out now. I'm running this on the same "Year,Location,Value" dataset you've got in the test. Here's my code:

from frictionless import describe_package
from csvw.frictionless import DataPackage

fr_package = describe_package('data.csv')
fr_package.to_json('frictionless_package.json')

csvw_package = DataPackage(dict(fr_package))
table_group = csvw_package.to_tablegroup()
table_group.to_file('csvw_package.json')

The output in csvw_package.json is:

{
    "dc:source": "{\"profile\": \"data-package\", \"resources\": [{\"path\": \"data.csv\", \"stats\": {\"hash\": \"999c77f63bfb76ace71d6c754488c3ff\", \"bytes\": 131, \"fields\": 3, \"rows\": 9}, \"control\": {\"newline\": \"\"}, \"encoding\": \"utf-8\", \"dialect\": {}, \"schema\": {\"fields\": [{\"name\": \"Year\", \"type\": \"integer\"}, {\"name\": \"Location\", \"type\": \"string\"}, {\"name\": \"Value\", \"type\": \"integer\"}]}, \"name\": \"data\", \"profile\": \"tabular-data-resource\", \"scheme\": \"file\", \"format\": \"csv\", \"hashing\": \"md5\", \"compression\": \"no\", \"compressionPath\": \"\", \"query\": {}}]}",
    "tables": [
        {
            "tableSchema": {
                "columns": [
                    {
                        "datatype": "integer",
                        "name": "Year"
                    },
                    {
                        "datatype": "string",
                        "name": "Location"
                    },
                    {
                        "datatype": "integer",
                        "name": "Value"
                    }
                ]
            },
            "url": "data.csv"
        }
    ]
}

The output in frictionless_package.json is:

{
  "profile": "data-package",
  "resources": [
    {
      "path": "data.csv",
      "stats": {
        "hash": "999c77f63bfb76ace71d6c754488c3ff",
        "bytes": 131,
        "fields": 3,
        "rows": 9
      },
      "control": {
        "newline": ""
      },
      "encoding": "utf-8",
      "dialect": {},
      "schema": {
        "fields": [
          {
            "name": "Year",
            "type": "integer"
          },
          {
            "name": "Location",
            "type": "string"
          },
          {
            "name": "Value",
            "type": "integer"
          }
        ]
      },
      "name": "data",
      "profile": "tabular-data-resource",
      "scheme": "file",
      "format": "csv",
      "hashing": "md5",
      "compression": "no",
      "compressionPath": "",
      "query": {}
    }
  ]
}

It definitely seems to be converting the schema (at least from my limited understanding of CSVW). Would you expect any of the other metadata to come through?

brockfanning commented 3 years ago

Another comment, but probably unrelated to this PR: I'm also testing against other CSV files, and it appears to fail when it gets to a CSV file with a space in the column headers. Does CSVW disallow spaces in column headers? Here is a snippet of the error, when it gets to a column header of "Industry sector":

venv/lib/python3.8/site-packages/csvw/utils.py", line 44, in valid_re
    raise ValueError(msg.format(value, attribute.name))
ValueError: Industry sector is not a valid name
xrotwang commented 3 years ago

@brockfanning re column names: Indeed, space is not allowed in CSVW column names, see https://github.com/cldf/csvw/blob/f53756f7d2888d621bf745791e85979d1907a445/src/csvw/metadata.py#L41-L43 - unless it's percent encoded. So Industry%20sector would be ok.

What the conversion could do, is

  1. Percent-encode invalid names, and use these as name property for the column.
  2. Add the original name to the titles property of the column.

Then, columns could still be unambiguously identified. When reading the data with csvw, the values would have to be accessed as row['Industry%20sector'], though - or a reader would have to map column titles to column names first.

xrotwang commented 3 years ago

Regarding completeness of the conversion: It's still work-in-progress. The CSV dialect spec will have to be addded. Regarding metadata like the things in stats, I'm not sure adding these makes a lot of sense. We'd have to add these using extra properties not specified by CSVW, and e.g. for stuff like hash sums, I'd rather have users wrap the whole package - i.e. data and metadata - in something like bagit.

brockfanning commented 3 years ago

Regarding completeness of the conversion: It's still work-in-progress. The CSV dialect spec will have to be addded. Regarding metadata like the things in stats, I'm not sure adding these makes a lot of sense. We'd have to add these using extra properties not specified by CSVW, and e.g. for stuff like hash sums, I'd rather have users wrap the whole package - i.e. data and metadata - in something like bagit.

Agreed - I definitely wouldn't want to suggest adding anything not specified by CSVW.

Can you help me out with a code example for if I wanted to add some arbitrary metadata to the table group? For example if I wanted to add "title" or "published", or even something not in the spec like "foo"?

xrotwang commented 3 years ago

CSVW's common properties are available as TableGroup.common_props. This is a dict you can just assign to:

tg.common_props['dc:title'] = '....'
tg.common_props['dc:date'] = '2021-01-06'

for properties which are not in any of the namespaces recognized by csvw, you'd have to use/make up full URIs, e.g.

tg.common_props['http://example.org/foo'] = '...'

or you just use dc:description and assign multiple values using a JSON object or array:

tg.common_props['dc:description'] = {'foo': '...'}
xrotwang commented 3 years ago

@brockfanning re-reading the CSVW spec, I can't find a passage about limiting column names anymore. So I assume I added this limitation because column names may be used in URI-templates - e.g. in aboutUrl or rowUrl, and the URI-Template spec does indeed limit variable names. However, it seems that this limitation is not enforced in the uritemplate package, which we use in csvw:

>>> from uritemplate import URITemplate
>>> t = URITemplate('http://example.org/{ab c}')
>>> t.expand({'ab c': 'x y'})
'http://example.org/x%20y'

So I'd now lean towards lifting the artificial limitation of valid column names, and push the burden of synchronizing column names and variable names in associated URL properties to data producers.

brockfanning commented 3 years ago

So I'd now lean towards lifting the artificial limitation of valid column names, and push the burden of synchronizing column names and variable names in associated URL properties to data producers.

That works for me.

And thank you for the code examples!

I will continue to test out this PR as you add commits, and I'll try to get feedback on the output from my other team members who know more about CSVW. However from my perspective it seems to be working great, and you are definitely in a better position to say when it's done. But I will make sure to leave comments if I run into any issues. Thanks for your support!

brockfanning commented 3 years ago

@xrotwang Along the same lines as the question about common properties, what is the best way to add properties to the tableSchema, and what about adding properties to individual columns?

For example, what is the best way to get the "propertyUrl", "valueUrl", and "aboutUrl" in this kind of snippet?

{
    "tables": [
        {
            "url": "data.csv",
            "tableSchema": {
                "columns": [
                    {
                        "name": "year",
                        "titles": "Year",
                        "datatype": "string",
                        "propertyUrl": "http://example.com",
                        "valueUrl": "http://example.com"
                    },
                    {
                        "name": "value",
                        "titles": "Value",
                        "datatype": "integer",
                        "propertyUrl": "http://example.com"
                    }
                ],
                "aboutUrl": "http://example.com"
            }
        }
    ]
}
xrotwang commented 3 years ago

Getting schema objects by position

>>> from csvw import TableGroup
>>> tg = TableGroup.from_file('data.csv-metadata.json')
>>> tg.tables[0].tableSchema.columns[0].propertyUrl
URITemplate("http://example.com")

or by name

>>> tg.tabledict['data.csv'].tableSchema.columndict['year'].propertyUrl
URITemplate("http://example.com")

And setting:

>>> from csvw.metadata import URITemplate
>>> tg.tables[0].tableSchema.columndict['year'].propertyUrl = URITemplate('http://example.com/{year}')
>>> tg.tables[0].tableSchema.columndict['year'].propertyUrl.expand(list(tg.tables[0])[0])
'http://example.com/2012'
xrotwang commented 3 years ago

@brockfanning the conversion is fairly complete now. What's lacking is some datatypes - notably the various date/time types, and column constraints.

brockfanning commented 3 years ago

@xrotwang This is working beautifully for me, thank you!

xrotwang commented 3 years ago

@brockfanning just released this functionality with csvw 1.9.0