This issue describes the basic need for a standalone reader for tabular data. We do want to support a few common formats that such data is often published in, and provide a consistent interface for reading data out of those sources.
As a first step, we should implement this interface for CSV, with a view to that fact that we are designing for other formats like Excel as well.
I've done some initial work in this direction, so here is a mini-spec based on that:
Spec
Example usage
# row would either be an array of values utf-8 encoded strings, or, an object of values keyed by column name
from tabulator import Tabulator
datasource = 'file.csv'
dataformat = 'csv'
options = {
datasource, # a filepath or a stream. Some formats, like Excel, could only be a filepath.
dataformat='csv', # 'csv', 'json', 'ndjson', 'excel', 'ods'
schema=None, # a dict of a valid JSON Table Schema, None, or 'infer'
headers=None, # None or integer or iterable . None means none, integer means the row where headers are, iterable means those are the headers, and don't look for them in the file
encoding=None, # encoding of the data source. prevents guessing, which can be wrong often
decode_strategy='replace' # decode strategy
keyed=False # whether to return plain tuples or named tuples for each row
}
datable = Tabulator(datasource, dataformat, **options)
for index, row, in enumerate(datatable.values):
yield (index, row)
Consistency over Py2/3
Whatever the stream input, turn into a utf-8 encoded text stream (likewise on opening file), so that components of a data processing pipeline do not have to handle this
Data flow
When the data source is ingested, it is put into a pipeline that will produce a utf-8 encoded text stream as the data is iterated over.
When .values is iterated over, each row goes through an extended pipeline of generators
if schema passed, check row against schema, or, infer schema - in each case, cast values based on schema and raise if casting errors (https://github.com/okfn/jsontableschema-py)
lastly, a generator that produces the row output for the consumer as
a tuple
a named tuple if keyed is True, and if we have headers
we'd want to be able to add generators into this pipeline. Two useful generators are ones that would do the following, but we can add those after the basics are in place
detect and act on blank rows
detect and act on rows with invalid dimensions
Consumers need to be able to do the following (there are example APIs for all this in GoodTables and MessyTables):
See the headers .headers
Take a sample .get_sample()
Reset the stream .replay
Catch errors when stream is not what it claims to be (e.g.: is HTML and not CSV)
@pudo comments/additions? I'm hoping we collaborate to turn this into the basis of MessyTables 2 and even going forward to merge what are now GoodTables and MessyTables. Let's get some working code pushed here to talk on :0. One of our developers @roll will be working on this over the next days.
Description
This issue describes the basic need for a standalone reader for tabular data. We do want to support a few common formats that such data is often published in, and provide a consistent interface for reading data out of those sources.
As a first step, we should implement this interface for CSV, with a view to that fact that we are designing for other formats like Excel as well.
I've done some initial work in this direction, so here is a mini-spec based on that:
Spec
Example usage
Consistency over Py2/3
Data flow
.values
is iterated over, each row goes through an extended pipeline of generatorskeyed
is True, and if we have headers.headers
.get_sample()
.replay
Tasks
@pudo comments/additions? I'm hoping we collaborate to turn this into the basis of MessyTables 2 and even going forward to merge what are now GoodTables and MessyTables. Let's get some working code pushed here to talk on :0. One of our developers @roll will be working on this over the next days.