frictionlessdata / tableschema-js

A JavaScript library for working with Table Schema.
http://frictionlessdata.io/
MIT License
82 stars 27 forks source link

Can we pass a row stream as a source? #92

Closed anuveyatsu closed 7 years ago

anuveyatsu commented 7 years ago

I noticed that we cannot pass a row stream as a source. It would try to create a row stream again and error - https://github.com/frictionlessdata/tableschema-js/blob/master/src/table.js#L220-L253

Is there a workaround for this case?

pwalsh commented 7 years ago

@anuveyatsu sounds useful to accept a stream. Can you do a PR for us to consider/review?

roll commented 7 years ago

@anuveyatsu

source (String/Array[]/Function) - data source (one of):

  • local CSV file (path)
  • remote CSV file (url)
  • array of arrays representing the rows
  • function returning readable stream with CSV file contents

Have you tried to pass a stream constructor to the Table class?

const source = () => // create your stream 
const table = await Table.load(source)
anuveyatsu commented 7 years ago

@roll I will try this approach and will update here. @pwalsh if necessary, I can do a PR.

roll commented 7 years ago

@anuveyatsu Please re-open if needed. Table accepts a stream constructor because AFAIK Node.js streams are not rewindable by default but Table needs an ability to read it more than one time.

rufuspollock commented 7 years ago

@roll whilst you can't rewind a stream I think it is standard practice to duplicate node streams by using a passthrough stream e.g.

var fs = require('fs')
var stream = require('stream')
var contents = fs.createReadStream('./bigfile') // greater than 65KB
var stream1 = contents.pipe(new stream.PassThrough())
var stream2 = contents.pipe(new stream.PassThrough())

stream1.on('data', function (data) { console.log('s1', data.length) })
stream1.on('end', function () {
  stream2.on('data', function (data) { console.log('s2', data.length) })
})

See also: https://stackoverflow.com/questions/19553837/node-js-piping-the-same-readable-stream-into-multiple-writable-targets

I get that you can move responsibility onto clients for doing this (as you have done) but you can do it internally.

Note also here: we don't really want the Table object per se - we just want to use infer 😄 -- this relates to the API discussion we've been having. Basically what would be perfect IMO is a simple method like:

const infer(stream) => schema

This is clean and simple purpose and could even be its own mini library (this makes it much easier for others to reuse and contribute to ...) - cf the discussion on the FD channel about these algorithms.

roll commented 7 years ago

@rufuspollock I've created a feature request for single stream support - https://github.com/frictionlessdata/tableschema-js/issues/95. Still not sure it's super critical because this functionality (stream related) is more for system integrators who is able to handle it on client side. And ordinary users pass file paths usually not streams. But if it will not complicate the Table class too much it's worth to try I think.

Note also here: we don't really want the Table object per se - we just want to use infer :smile: -- this relates to the API discussion we've been having. Basically what would be perfect IMO is a simple method like: const infer(stream) => schema

I think it's readme issue - in tableschema@1.0 you could pass to infer anything that Table class accepts:

const descriptor = await infer(array)
const descriptor = await infer('data.csv')
const descriptor = await infer('http://example.com/data.csv')
const descriptor = await infer(streamConstructor)
const descriptor = await infer(stream) // if we implement https://github.com/frictionlessdata/tableschema-js/issues/95