Move getReadStream up to DataResource in Data Package (and other API improvements)

rufuspollock commented 7 years ago

The immediate suggestion is about using getReadStream (https://github.com/frictionlessdata/tableschema-js/blob/master/src/table.js#L218) to make a stream or rawStream method on Data Resource.

Aside: I don't think getReadStream needs to return a promise - it can be synchronous I think!

However this is just lead-in to the big message ...

Context

There's some important context here. It comes from thinking a lot about the relationship of the various FD libraries and esp the relationship of something like tabulator (nice stream), tableschema (infer, schema, Table) and Data Package / Resource libs.

Long story short: I think with a few tweaks we can get something really nice as an interface and we can centralize it in one lib (at the moment, whilst the libs are nicely factored you do have to look into too many places -- e.g. this issue ends up being here but relating to 3 other libs).

First I must emphasize a key point:

People don't care about Data Packages / Resources they care about opening a data file and doing something with it

Data Packages / Resources come up because they are a nicely agreed metadata structure for all the stuff that comes up in the background when you do that.

Put crudely: Most people are doing stuff with a file (or dataset), and they want to grab it and read it preferably in a structured way e.g. as a row iterator -- sometimes inferring or specifying stuff a long the way e.g. encoding, formatting, field types.

=> Our job is to help users to open that file (or dataset) and stream it as quickly as possible.

Interface

Here's a stab at at the minimal viable interface focused on the file only case:

const resource = new DataResource(path)

// crude rows - no type guessing unless i told it to do it
resource.rowStream()

// i might also want ...
resource.rawStream()

Here's a more detailed API definition:

// path can be local path or a url or whatever ...
const resource = new DataResource(path)

// this will spit out anything it has inferred and look a lot like a Data Resource object
// crucially it will have path, pathType (local, url etc), format, mediaType (the latter guessed), encoding
// I should be able to set any of these when I construct it if i want ... 
resource.descriptor

// ========
// Data access
// now the real stuff i want i.e. getting the contents of the file or info about it

// returns a JS stream object or a file-like object in python
// maybe call this rawStream() if we want to be really clear
resource.stream()

// if this resource can be parsed to rows of objects (like tabular stuff can - or even geojson!)
// this is class JS object stream
// for python the Stream stuff from tabulator-py)
resource.rowStream()

// row stream (or maybe bytestream)
// metadata as first row (in bytestream would be first line)
// cf way that datapackage pipelines stream entire data package over single stdin/stdout pipe
//  this is a bit optional ... (could also implement via a flag to rowStream() or stream)
resource.singleStream()

// ------
// extras

// infer types
resource.infer()

// infer structure (e.g. csv separator etc)
resource.inferStructure()

// if this makes sense
resource.headers

// something like this ...
resource.isTabular

Comments

The above was done for JS but could be common to python and any other language too (tabulator-py Stream would need to have a stream() method rather than open() or similar but that's minor).

At the moment we have all the ingredients for this but we don't provide this in one place often e.g.

https://github.com/frictionlessdata/datapackage-js#resource - Resource is missing the stream methods and we have to call Table and then iter to get objects (and why don't we support Node Stream API for the object streams?)
Python: you have to work your way across tabulator-py, tableschema-py and datapackage-py to put the underlying concepts and approach together.
General point: I'm wondering if we almost want to leave the Data Package stuff to one side and present this front and center - this data interface is what people really want and the Data Package and Data Resource are just the descriptor structures that you happen to need. (This is more a presentation aspect but its important e.g. you'd start examples not loading a datapackage.json but loading a file path)

Asides

More and more think the profile stuff should be a separate library outside of core. 90% of the time I don't want and I don't want to know about it and I don't want to validate before I start -- I just want to load the data! If I want it i can import/require it separately and use it separately to test whatever i have once i've loaded it.
- As an example of how this gets in the way: two of the first four items in the first example for datapackage-py deal with validation and the fairly obscure issue of schemas not in the local cache https://github.com/frictionlessdata/datapackage-py

roll commented 7 years ago

@rufuspollock (cc @pwalsh) What you've described it's almost exactly what tableschema.Table class does (in reference implementation for Python). For JavaScript for now it missed schema inferring, headers etc. So it's kinda half-implemented.

I see and do understand the problem with one entry point for the user. I also see the friction here even a few things you've described are aside problems like:

e.g. datapackage-py just doesn't have documentation at all so hard to judge it for now
any tabulator knowledge to use tableschema/datapackage is not required. Again it's only docs problem.

But what we CAN'T do here:

redesign libs for 2 languages. It's just not possible. At first it looks simple (e.g. your API example). Then you goes to streaming, opening/closing file descriptors, having deal with sql-etc export for Python, different args like descriptor/path etc. At first place I've started to design TabularResource! And have failed to make it good enough for all use cases we have (and there are much more than you've described e.g. Table is a core of goodtables)

What we COULD do:

first
- move Table class to datapackage and it will be almost what you've described (with simple improvements exactly what you want). With good docs on data package level etc. Consider Table almost as a TabularResource it's just decoupled from being a subclass of Resource (and it's done for a reason as said above)
- OR we even could consider merging tableschema-datapackage (it's really simple). I already have solid amount of duplicated code to work with descriptors/profiles between libs. And now without clear separation between Table Schema/Data Package on specs level (now there is Data Resource and other) it could make sense. It will allow as to have first-class docs in one place.
second
- after v1 release we could consider adding additional API to Resource class like rowStream/iter_rows() if tabular etc if still needed. Or even merging Resource/Table if we will be sure it's good enough for all use cases.

rufuspollock commented 7 years ago

@roll i've got a long way through a PoC in JS for all of this here that shows how an interface could look:

https://github.com/datahq/datahub-cli/blob/master/lib/utils/data.js

I've also been thinking that a lot of stuff like infer could be their own tiny libs and operate on Resources.

roll commented 7 years ago

FIXED/WONTFIX

Current datapackage-js@v1 now uses a lot from this feedback and from initial data.js. So it fully covers initial Here's a more detailed API definition functionality and data flow:

// path can be local path or a url or whatever ...
const resource = await Resource.load({path: 'data.csv'})

// Descriptor following Data Resource spec
// Minimal before inferring 
resource.descriptor

// returns a JS stream object or a file-like object in python
resource.iter({stream: true})

// if this resource can be parsed to rows of objects (like tabular stuff can - or even geojson!)
if (resource.tabular) {
  resource.table.iter() // array of rows
  resource.table.iter({keyed: true}) // array of keyed rows
  resource.table.iter({stream: true}) // Node Stream of rows (all row options could be used like keyed)
  // etc
}

// infer types
// infer structure (e.g. csv separator etc) - not implemented yet
resource.infer()

// now we have schema but could disable casting
if (resource.tabular) {
  resource.iter({cast: false}) // array of rows without casting
}

// if this makes sense
resource.table.headers

// something like this ...
resource.tabular // true

More detailed API and tutorial - https://github.com/frictionlessdata/datapackage-js#resource

Link to an alternative implementation data.js:

https://github.com/datahq/data.js#datajs

frictionlessdata / tableschema-js