Closed rufuspollock closed 7 years ago
@rufuspollock (cc @pwalsh)
What you've described it's almost exactly what tableschema.Table
class does (in reference implementation for Python). For JavaScript for now it missed schema inferring, headers etc. So it's kinda half-implemented.
I see and do understand the problem with one entry point
for the user. I also see the friction here even a few things you've described are aside problems like:
datapackage-py
just doesn't have documentation at all so hard to judge it for nowtabulator
knowledge to use tableschema/datapackage
is not required. Again it's only docs problem.But what we CAN'T do here:
TabularResource
! And have failed to make it good enough for all use cases we have (and there are much more than you've described e.g. Table
is a core of goodtables
)What we COULD do:
Table
class to datapackage
and it will be almost what you've described (with simple improvements exactly what you want). With good docs on data package level etc. Consider Table
almost as a TabularResource
it's just decoupled from being a subclass of Resource
(and it's done for a reason as said above)tableschema-datapackage
(it's really simple). I already have solid amount of duplicated code to work with descriptors/profiles between libs. And now without clear separation between Table Schema/Data Package on specs level (now there is Data Resource and other) it could make sense. It will allow as to have first-class docs in one place.Resource
class like rowStream/iter_rows()
if tabular etc if still needed. Or even merging Resource/Table
if we will be sure it's good enough for all use cases. @roll i've got a long way through a PoC in JS for all of this here that shows how an interface could look:
https://github.com/datahq/datahub-cli/blob/master/lib/utils/data.js
I've also been thinking that a lot of stuff like infer could be their own tiny libs and operate on Resources.
FIXED/WONTFIX
Current datapackage-js@v1
now uses a lot from this feedback and from initial data.js
. So it fully covers initial Here's a more detailed API definition
functionality and data flow:
// path can be local path or a url or whatever ...
const resource = await Resource.load({path: 'data.csv'})
// Descriptor following Data Resource spec
// Minimal before inferring
resource.descriptor
// returns a JS stream object or a file-like object in python
resource.iter({stream: true})
// if this resource can be parsed to rows of objects (like tabular stuff can - or even geojson!)
if (resource.tabular) {
resource.table.iter() // array of rows
resource.table.iter({keyed: true}) // array of keyed rows
resource.table.iter({stream: true}) // Node Stream of rows (all row options could be used like keyed)
// etc
}
// infer types
// infer structure (e.g. csv separator etc) - not implemented yet
resource.infer()
// now we have schema but could disable casting
if (resource.tabular) {
resource.iter({cast: false}) // array of rows without casting
}
// if this makes sense
resource.table.headers
// something like this ...
resource.tabular // true
More detailed API and tutorial - https://github.com/frictionlessdata/datapackage-js#resource
Link to an alternative implementation data.js
:
The immediate suggestion is about using
getReadStream
(https://github.com/frictionlessdata/tableschema-js/blob/master/src/table.js#L218) to make astream
orrawStream
method on Data Resource.Aside: I don't think
getReadStream
needs to return a promise - it can be synchronous I think!However this is just lead-in to the big message ...
Context
There's some important context here. It comes from thinking a lot about the relationship of the various FD libraries and esp the relationship of something like tabulator (nice stream), tableschema (infer, schema, Table) and Data Package / Resource libs.
Long story short: I think with a few tweaks we can get something really nice as an interface and we can centralize it in one lib (at the moment, whilst the libs are nicely factored you do have to look into too many places -- e.g. this issue ends up being here but relating to 3 other libs).
First I must emphasize a key point:
People don't care about Data Packages / Resources they care about opening a data file and doing something with it
Put crudely: Most people are doing stuff with a file (or dataset), and they want to grab it and read it preferably in a structured way e.g. as a row iterator -- sometimes inferring or specifying stuff a long the way e.g. encoding, formatting, field types.
=> Our job is to help users to open that file (or dataset) and stream it as quickly as possible.
Interface
Here's a stab at at the minimal viable interface focused on the file only case:
Here's a more detailed API definition:
Comments
The above was done for JS but could be common to python and any other language too (tabulator-py Stream would need to have a stream() method rather than open() or similar but that's minor).
At the moment we have all the ingredients for this but we don't provide this in one place often e.g.
Asides