cldf / csvw

CSV on the web
Apache License 2.0
37 stars 6 forks source link

Example of programmatically creating metadata from scratch #48

Closed brockfanning closed 3 years ago

brockfanning commented 3 years ago

Hello,

I was wondering if it would be possible to provide an example of creating the metadata from scratch. My goal is to create a metadata file by programmatically examining a CSV file to determine the schema. No worries if this is out of scope for the library.

Thank you!

LinguList commented 3 years ago

There are plenty examples. Pretty customized metadata files are used in the concepticon project. Just check there for the folder concepticondata/conceptlists/ where each TSV file is accompanied by a metadata file.

brockfanning commented 3 years ago

@LinguList Thanks for the quick reply! I apologize my question was unnecessarily vague. I'm actually looking for an example of programmatically (with Python code) creating metadata from scratch, using this library. To elaborate, I'm looking for either of these things, in order of preference:

  1. Python example of loading a CSV file and then generating the JSON metadata based on the CSV contents
  2. Python example of generating the JSON metadata using this library

The first item above is my ultimate goal, but if that is not possible with this library then I could get there with the help of the second item.

xrotwang commented 3 years ago

Hm. For 1, i.e. infering metadata from CSV you may be better off using the frictionless framework - even if you'd want to go with CSVW, which would require a simple transformation of the JSON metadata.

brockfanning commented 3 years ago

@xrotwang That's a great lead, thank you! Do you happen to know of any Frictionless-to-CSVW metadata converters out there? As you say it should be simple, though I would need to ramp up on both specs.

xrotwang commented 3 years ago

I'd be willing to help with writing the converter :)

brockfanning commented 3 years ago

@xrotwang That's amazing! Should I kick things off by creating a separate repo for this, and then ping you when I get stuck? Or were you imagining this being added to an existing repo?

xrotwang commented 3 years ago

As far as I'm concerned, such a converter could live in this repo as well, possibly as class method Dataset.from_frictionless_metadata or similar.

brockfanning commented 3 years ago

That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated!

xrotwang commented 3 years ago

Will try to make a start this week.

brockfanning notifications@github.com schrieb am Mo., 4. Jan. 2021, 19:49:

That works for me. I'm not sure where to start here, so any help you can provide, even if only partial progress, would be greatly appreciated!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cldf/csvw/issues/48#issuecomment-754149212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKFT67WRBTIIP4VQOFTSYIET3ANCNFSM4VTDE4HA .

xrotwang commented 3 years ago

Skimming the documentation at https://frictionlessdata.io/tooling/python/describing-data/#describing-schema it seems that frictionless can infer the CSV dialect and the data types (partially), but not things like foreign key relations. But we could start out with the output of frictionless describe PATH/TO/TABLE --json --source-type package, i.e. something like

{
  "profile": "data-package",
  "resources": [
    {
      "path": "forms.tsv",
      "stats": {
        "hash": "91c89a7d4fe4d5d55ad8a383a64ea047",
        "bytes": 192127,
        "fields": 13,
        "rows": 2249
      },
      "control": {
        "newline": ""
      },
      "encoding": "utf-8",
      "dialect": {
        "delimiter": "\t"
      },
      "schema": {
        "fields": [
          {
            "name": "ID",
            "type": "string"
          },
          {
            "name": "Language_ID",
            "type": "string"
          },
          {
            "name": "Parameter_ID",
            "type": "integer"
          },
          {
            "name": "Value",
            "type": "string"
          },
          {
            "name": "Comment",
            "type": "any"
          },
          {
            "name": "Source",
            "type": "string"
          },
          {
            "name": "Graphemes",
            "type": "string"
          },
          {
            "name": "Profile",
            "type": "string"
          }
        ]
      },
      "name": "forms",
      "profile": "tabular-data-resource",
      "scheme": "file",
      "format": "tsv",
      "hashing": "md5",
      "compression": "no",
      "compressionPath": "",
      "query": {}
    }
  ]
}

and convert this into a csvw.TableGroup - which could then be enhanced programmatically.

xrotwang commented 3 years ago

@brockfanning could you give an example of the kind of CSV you'd want to create CSVW for? Does it use any naming scheme to give type or foreign key hints?

brockfanning commented 3 years ago

@xrotwang In my case it's pretty simple: each data package has exactly one standalone CSV file. So I don't believe there is any concern about foreign key hints (at least in my case). We don't have any naming scheme related to types. Here's an example:

Year Location Value
2010 10
2011 20
2012 30
2010 Urban 12
2011 Urban 22
2012 Urban 32
2010 Rural 14
2011 Rural 24
2012 Rural 34
xrotwang commented 3 years ago

Ok, so below is what frictionless describe makes off such data. It's enough to use csvw to read the data correctly, i.e. we have the info about

Once we can read the data with csvw, we could add information like

OTOH, that would make the data more difficult to edit - e.g. adding a row with a value below an inferred minimum would make the data invalid.

So I guess, we could/should distinguish two use cases:

  1. Seeding CSVW metadata for data which is still added to.
  2. Adding CSVW metadata to "finished" data for publication (in which case inferring properties like uniqueness, etc. would be useful).
---
metadata: test.csv
---

compression: 'no'
compressionPath: ''
control:
  newline: ''
dialect:
  delimiter: '|'
encoding: utf-8
format: csv
hashing: md5
name: test
path: test.csv
profile: tabular-data-resource
query: {}
schema:
  fields:
    - name: Year
      type: integer
    - name: Location
      type: string
    - name: Value
      type: integer
scheme: file
stats:
  bytes: 131
  fields: 3
  hash: b36e8c21563ab32645052c11510bddb7
  rows: 9
brockfanning commented 3 years ago

@xrotwang Just my 2 cents: If it helps keep things simple I would be fine with assuming that all inferring of the schema, and adjustments to the schema, will happen in the Frictionless object, before it gets to this library. For example if the schema needs constraints added, they'd be added according to the Frictionless table schema. In that case this library could focus on converting (as faithfully as possible) that metadata to meet the CSVW spec.

Side note: I agree with your distinction of use-cases. I expect many providers want to publish CSVW just to make their data "interoperable". While others want the metadata for validation so that they can avoid maintenance problems in the future. My users definitely need the "infer from CSV" interoperability approach (hence this issue). To that end I'm imagining automating some of the adjustments you mention, like minimum/maximum, and turning values into an "enum". But again I am fine with doing all of that to the Frictionless object before sending it to this library.

xrotwang commented 3 years ago

Of course, the adjustments can also be made in the CSVW metadata - either manually editing the serialized JSON, or programmatically on the csvw.TableGroup object, which csvw.TableGroup.from_frictionless_metadata would return.

xrotwang commented 3 years ago

And I agree, getting the simple case up and running would not only be the first step, but presumably useful functionality already. I have a proof-of-concept in my head :) - hope to find the time later today to push for you to review.