ColinMaudry / dgfr-tabular-metadata

Describing tabular data published on data.gouv.fr with CSVW metadata
0 stars 0 forks source link

An example of parsing with Peewee and their CSV loader #8

Open davidbgk opened 8 years ago

davidbgk commented 8 years ago

Using Python3, no semantic logic:


import json

from peewee import CharField, IntegerField
from playhouse.csv_loader import SqliteDatabase, load_csv

db = SqliteDatabase(':memory:')

ROOT_DATA_FOLDER = '../dgfr-tabular-metadata/elections-regionales/'

# From CSVW datatypes to peewee fields.
datatype_to_field_class = {
    'integer': IntegerField,
    'string': CharField
}

# Create fields and their names from the CSVW file.
with open(ROOT_DATA_FOLDER + 'RG15_Bvot_T2.json') as j:
    spec = json.loads(j.read())
    field_names = []
    fields = []
    for column in spec['tableSchema']['columns']:
        field_names.append(column['name'])
        field_class = datatype_to_field_class[column['datatype']]
        # `verbose_name` not in use for now.
        fields.append(field_class(verbose_name=column['dcterms:description']))

# You have to convert the file to UTF-8 first:
# $ iconv -f ISO_8859-1 -t UTF-8 RG15_Bvot_T2.txt > RG15_Bvot_T2-utf8.txt
csv_path = ROOT_DATA_FOLDER + 'RG15_Bvot_T2-utf8.txt'

# The initial loading takes about 1 minute for ~200 000 lines/12Mo of CSV.
# It might be put into memory on script launch(?)
elections = load_csv(db, csv_path, fields=fields, field_names=field_names,
                     **{'delimiter': ';'})

# Iterate on elections' results for Arles.
for election in (elections.select()
                          .where(elections.CODDPT == '13')
                          .where(elections.CODSUBCOM == '004')):
    print(election.LIBSUBCOM, election.CODNUA)

Discuss :)

ColinMaudry commented 8 years ago

Depending on what you want to achieve, there is an official JSON-LD Python utility, that does what the JSON-LD playground does: consumes JSON-LD and its context, and outputs various flavours of JSON-LD (flattened, expanded, etc.) or RDF (N-Quads, N-Triples).

Obviously, what we need most of all, is a utility (preferably in Python) that consumes a CSV and its CSVW annotation to produce a JSON-LD version of the CSV.