Description

This issue describes the basic need for a standalone reader for tabular data. We do want to support a few common formats that such data is often published in, and provide a consistent interface for reading data out of those sources.

As a first step, we should implement this interface for CSV, with a view to that fact that we are designing for other formats like Excel as well.

I've done some initial work in this direction, so here is a mini-spec based on that:

Spec

Example usage

# row would either be an array of values utf-8 encoded strings, or, an object of values keyed by column name
from tabulator import Tabulator

datasource = 'file.csv'
dataformat = 'csv'
options = {
    datasource,  # a filepath or a stream. Some formats, like Excel, could only be a filepath.
    dataformat='csv',  # 'csv', 'json', 'ndjson', 'excel', 'ods'
    schema=None,  # a dict of a valid JSON Table Schema, None, or 'infer'
    headers=None,  # None or integer or iterable . None means none, integer means the row where headers are, iterable means those are the headers, and don't look for them in the file
    encoding=None,  # encoding of the data source. prevents guessing, which can be wrong often
    decode_strategy='replace'  # decode strategy
    keyed=False # whether to return plain tuples or named tuples for each row
}
datable = Tabulator(datasource, dataformat, **options)

for index, row, in enumerate(datatable.values):
    yield (index, row)

Consistency over Py2/3

Whatever the stream input, turn into a utf-8 encoded text stream (likewise on opening file), so that components of a data processing pipeline do not have to handle this
Data flow
When the data source is ingested, it is put into a pipeline that will produce a utf-8 encoded text stream as the data is iterated over.
When .values is iterated over, each row goes through an extended pipeline of generators
- if schema passed, check row against schema, or, infer schema - in each case, cast values based on schema and raise if casting errors (https://github.com/okfn/jsontableschema-py)
- lastly, a generator that produces the row output for the consumer as
- a tuple
- a named tuple if keyed is True, and if we have headers
we'd want to be able to add generators into this pipeline. Two useful generators are ones that would do the following, but we can add those after the basics are in place
- detect and act on blank rows
- detect and act on rows with invalid dimensions
Consumers need to be able to do the following (there are example APIs for all this in GoodTables and MessyTables):
- See the headers .headers
- Take a sample .get_sample()
- Reset the stream .replay
- Catch errors when stream is not what it claims to be (e.g.: is HTML and not CSV)
  Tasks
[x] First implementation of reader
[x] Reasonable suite of tests (many example tests can be taken/adapted from https://github.com/okfn/goodtables/blob/master/tests/test_datatable.py and https://github.com/okfn/messytables/tree/master/test)

@pudo comments/additions? I'm hoping we collaborate to turn this into the basis of MessyTables 2 and even going forward to merge what are now GoodTables and MessyTables. Let's get some working code pushed here to talk on :0. One of our developers @roll will be working on this over the next days.

frictionlessdata / tabulator-py

Implement CSV opener/reader #2

Description

Spec

Example usage

Consistency over Py2/3

Data flow

Tasks