frictionlessdata / tableschema-js

A JavaScript library for working with Table Schema.
http://frictionlessdata.io/
MIT License
82 stars 27 forks source link

Move getReadStream up to DataResource in Data Package (and other API improvements) #78

Closed rufuspollock closed 7 years ago

rufuspollock commented 7 years ago

The immediate suggestion is about using getReadStream (https://github.com/frictionlessdata/tableschema-js/blob/master/src/table.js#L218) to make a stream or rawStream method on Data Resource.

Aside: I don't think getReadStream needs to return a promise - it can be synchronous I think!

However this is just lead-in to the big message ...

Context

There's some important context here. It comes from thinking a lot about the relationship of the various FD libraries and esp the relationship of something like tabulator (nice stream), tableschema (infer, schema, Table) and Data Package / Resource libs.

Long story short: I think with a few tweaks we can get something really nice as an interface and we can centralize it in one lib (at the moment, whilst the libs are nicely factored you do have to look into too many places -- e.g. this issue ends up being here but relating to 3 other libs).

First I must emphasize a key point:

People don't care about Data Packages / Resources they care about opening a data file and doing something with it

Data Packages / Resources come up because they are a nicely agreed metadata structure for all the stuff that comes up in the background when you do that.

Put crudely: Most people are doing stuff with a file (or dataset), and they want to grab it and read it preferably in a structured way e.g. as a row iterator -- sometimes inferring or specifying stuff a long the way e.g. encoding, formatting, field types.

=> Our job is to help users to open that file (or dataset) and stream it as quickly as possible.

Interface

Here's a stab at at the minimal viable interface focused on the file only case:

const resource = new DataResource(path)

// crude rows - no type guessing unless i told it to do it
resource.rowStream()

// i might also want ...
resource.rawStream()

Here's a more detailed API definition:

// path can be local path or a url or whatever ...
const resource = new DataResource(path)

// this will spit out anything it has inferred and look a lot like a Data Resource object
// crucially it will have path, pathType (local, url etc), format, mediaType (the latter guessed), encoding
// I should be able to set any of these when I construct it if i want ... 
resource.descriptor

// ========
// Data access
// now the real stuff i want i.e. getting the contents of the file or info about it

// returns a JS stream object or a file-like object in python
// maybe call this rawStream() if we want to be really clear
resource.stream()

// if this resource can be parsed to rows of objects (like tabular stuff can - or even geojson!)
// this is class JS object stream
// for python the Stream stuff from tabulator-py)
resource.rowStream()

// row stream (or maybe bytestream)
// metadata as first row (in bytestream would be first line)
// cf way that datapackage pipelines stream entire data package over single stdin/stdout pipe
//  this is a bit optional ... (could also implement via a flag to rowStream() or stream)
resource.singleStream()

// ------
// extras

// infer types
resource.infer()

// infer structure (e.g. csv separator etc)
resource.inferStructure()

// if this makes sense
resource.headers

// something like this ...
resource.isTabular

Comments

The above was done for JS but could be common to python and any other language too (tabulator-py Stream would need to have a stream() method rather than open() or similar but that's minor).

At the moment we have all the ingredients for this but we don't provide this in one place often e.g.

Asides

roll commented 7 years ago

@rufuspollock (cc @pwalsh) What you've described it's almost exactly what tableschema.Table class does (in reference implementation for Python). For JavaScript for now it missed schema inferring, headers etc. So it's kinda half-implemented.

I see and do understand the problem with one entry point for the user. I also see the friction here even a few things you've described are aside problems like:

But what we CAN'T do here:

What we COULD do:

rufuspollock commented 7 years ago

@roll i've got a long way through a PoC in JS for all of this here that shows how an interface could look:

https://github.com/datahq/datahub-cli/blob/master/lib/utils/data.js

I've also been thinking that a lot of stuff like infer could be their own tiny libs and operate on Resources.

roll commented 7 years ago

FIXED/WONTFIX


Current datapackage-js@v1 now uses a lot from this feedback and from initial data.js. So it fully covers initial Here's a more detailed API definition functionality and data flow:

// path can be local path or a url or whatever ...
const resource = await Resource.load({path: 'data.csv'})

// Descriptor following Data Resource spec
// Minimal before inferring 
resource.descriptor

// returns a JS stream object or a file-like object in python
resource.iter({stream: true})

// if this resource can be parsed to rows of objects (like tabular stuff can - or even geojson!)
if (resource.tabular) {
  resource.table.iter() // array of rows
  resource.table.iter({keyed: true}) // array of keyed rows
  resource.table.iter({stream: true}) // Node Stream of rows (all row options could be used like keyed)
  // etc
}

// infer types
// infer structure (e.g. csv separator etc) - not implemented yet
resource.infer()

// now we have schema but could disable casting
if (resource.tabular) {
  resource.iter({cast: false}) // array of rows without casting
}

// if this makes sense
resource.table.headers

// something like this ...
resource.tabular // true

More detailed API and tutorial - https://github.com/frictionlessdata/datapackage-js#resource


Link to an alternative implementation data.js: