creationix / js-git

A JavaScript implementation of Git.
MIT License
3.83k stars 265 forks source link

Simple streams feedback #17

Closed Raynos closed 11 years ago

Raynos commented 11 years ago

See full spec here

Sinks as object

I think having { consume } or some other name would be nice

var stream = { read, stop }
var sink = { consume }

cc @dominictarr

creationix commented 11 years ago

@dominictarr

so, I can think of a few situations where the reason might be important. example: on tcp you want to know if the stream failed because there was no server, or if it dropped the connection, or it timed out.

Sorry if I didn't explain right, but of course you need the reason downstream. And you'll still have it, that's what the err argument in read's callback is for. I was talking about the err argument that would be in abort before the callback.

All of these error cases you mentioned would still be reported and come out of the continuable that the sink returns.

creationix commented 11 years ago

The "reason" that's not important is for a source to know why it's consumer is going to no-longer consume from it. It doesn't care why, it just needs to know so it can clean up stuff. It's downstream, the consumer, that cares why stuff is broken.

creationix commented 11 years ago

@Raynos

sink = function that accepts a stream (and returns a continuable)

It might make sense for sink to be a function that returns an object with a consume method for purposes of structural typing. That consume method then accepts a stream and returns a continuable

There is no reason you can't do that. I just don't want to force such a verbose construct in the spec since it's not needed or even wanted most the time.

Structural typing matters more for anonymous things that are passed around and used as return values and arguments all the time. Streams definitely fall under this category. Sinks are more like API endpoints that consume streams. I don't think they need structural typing as much. I know that fs.writeStream(stream, path, options) -> continuable is a sink because of it's API docs, it's name, and it's documented signature.

creationix commented 11 years ago

So usage would be:

fs.writeStream(path, options).consume(stream)(callback);

vs

fs.writeStream(stream, path, options)(callback);

With the first one, I feel an urge to use a promise instead of a continuable so that it's .consume(stream).then(callback)

Raynos commented 11 years ago

Aw man promises. That's going to be a hard battle.

W3C is going to be like "your api seems nice but read() should return a promise"

creationix commented 11 years ago

And I'll say, well if you insist on taking-no-args-and-then-returning-an-object-that-has-a-prototype-that-has-a-method-that-accepts-a-callback-and-an-errback-and-optional-progressback instead of just take-the-callback, then I guess you insist on complexity and better change the name away from "simple streams"

Raynos commented 11 years ago

@creationix you are preaching to the choir.

Gozala commented 11 years ago

I'm jumping to this little late (& maybe I should not do it at all). But I'll still provide my feedback based of my experience working on streams / signals / channels or whatever you wanna call them.

  1. I think in nature there is just input if you look it from the other side it will be an output. This is to say you don't need sync or duplex, you just need a data type for representing collection of eventual data chunks. Then you can write transformation functions that transforms a -> b. If you want duplex is just pair of same data types where data is pushed from left side to input end and from right side on the output end. sync is just a function that's just aware of data type's interface and there for can read data out of it and do whatever it needs to. That's also where reduce functions is somewhat relevant since it takes accumulates state by calling reducer with previous state and a next value.
  2. It took me a while but I got to understand that data types (or shapes if you prefer so) are a lot more composable. That to say that I dislike .end, .abort .close. I think stream / signal API does not needs any of these, although it can be added as a sugar if desired. If I'd done it today I'd define stream / signal as simple as this:
 var stream = {
   spawn: function(next) {
     next(1);
     next(2);
     next(END);
   }
}

Where END is whatever you desire it to be special value, type or shape does not matter as long as it's specified. If streams has an error it can pass them and good thing is JS already has Error type for this.

 var stream = {
   spawn: function(next) {
     next(1);
     next(2);
     next(Error("oops!");
   }
}
  1. Sync or I'd rather say consumer can have a direct coordination with an input without having to have a hold of it or having a methods like abort on every transformation. All it needs to is return value:
function take(n, input) {
  return {
    spawn: function(next) {
      var left = n;
      input.spawn(function(value) {
        return n === 0 ? ABORT : next(value)
      })
    }
  }
}

Of course our input will have to recognize ABORT and send back END. If input does not really recognizes some of this messages we still can wrap it a normalizer to force it to comply, or to the very prevent it's brokenness to infect rest of the pipeline.

  1. Of course we can't talk about streams without back-pressure, but you may have noticed that 3. actually illustrates I/O coordination and back-pressure is just a different form of it. To be more specific consumer can return back data indicating that backpressure should be applied, then respectful source will do that. And if you happen to deal with streams that don't really respect backpressure still not a big deal, cause you can write buffer(input) that will respect backpressure and will buffer up input for consumer. I have explored this technique in fs-reduce library where all of the streams respect backpressure, but if you happen to face stream (like array) that does not it will be just buffered up until pressure is released.
  2. I don't think there is any winner in pull VS push type streams, and if there was a one it'll be push since in nature we have events we can't actually pause (users clicking their mice for example). But hey that's not a big deal to since that's just another flavor of I/O coordination all you need to do is make pull(stream) which will give you pull based API (maybe that'll match min-stream proposal) and all it will have to do is:
function pull(stream) {
  var buffer = [];
  var reads = [];
  var resume = function() { stream.spawn(accumulate); }

  function drain() {
    while (reads.length && buffer.length) reads.shift()(buffer.shift())
    return buffer.length ? PAUSE : null;
  }

  function accumulate(value) {
    // save value and pause stream
    buffer.push(value);
    return drain(buffer, reads);
  }

  // stream will give provide function to resume it.
  function PAUSE(go) { resume = go }

  return {
    read: function(callback) {
      reads.push(callback);
      drain();
    }
  }
}

This is queue like (imperfect) API it should give an idea how different kind of ones can be easily created in form of simple functions that compose.

That sums up my opinion on how streams should work based experience of building them for at least 4 times over past few years. I hope this will be helpful and not totally boring and insane.

mhart commented 11 years ago

I really like the abort method - it's something that still puzzles me about streams as they stand in v0.10.x - hence my as-yet-unanswered question on the node.js group.

The only other question I'd pose (not sure if it has been already) is whether you should include anything about ES6 iterators (and generators) in the spec - I notice they're not mentioned at all and figure they should at least be referred to, even if it's to say that there's no goal to make them compatible, or whatever.

creationix commented 11 years ago

@Gozala I was wondering when/if you would comment on this thread.

Yes, I agree that the only interface that needs to be specified is the readable stream.

As far as using special tokens for END, ABORT, and Error classes, I'd rather not. instanceof Error doesn't work if the error is from another context. There is no Error.isError helper function though Object.prototype.toString.call(err) === "[object Error]" seems to be reliable. I'd hate to force such a verbose type check on each and every data chunk that goes through the stream. Having two positional arguments tells us a lot that speeds up such checks.

Yes back-pressure can be done with a manual side-channel and pause and resume commands, but I much prefer the implicit backpressure provided by pull style. In my experience I'm much more likely to get it right if I'm using pull-streams than writing the back-pressure by hand using manual pause and resume.

Yes there are general helpers that convert between types. I am publishing a module right now called push-to-pull that lets you write the easier push filters, but use them as back-pressure honoring pull-filters without writing your own queues. A reduce transform could easily be written as could a filter transform.

Thanks for the input.

creationix commented 11 years ago

@mhart I'm glad you like abort. You can thank @dominictarr for convincing me to add that to the official spec. I really didn't want to.

I've also considered what a stream would look like if we had access to ES6 generators. I think the simplest construct would be a generator that yielded values.

function* source() {
  yield 1
  yield 2
  yield 3
}

// Consume like any other generator to get [1, 2, 3]

But like most I/O streams, you can't yield everything at once, so the generator could yield continuables instead of raw values.

By happy coincidence, simple-stream's read function is itself a continuable. So turning a simple-stream into a generator based stream is as simple as:

function* () {
  // Create a simple-stream
  var stream = fs.readStream("myfile.txt");
  // and yield it forever
  while (true) yield stream.read
}

In fact my gen-run library does something very much like this, but as a control-flow helper library.

run(function* () {
  var stream = fs.readStream("myfile.txt");
  var data;
  var items = [];
  while (data = yield stream.read) {
    items.push(data);
  }
  return items;
});

I don't want to require generators for streams since it will be a long time before most js environment can assume ES6 generators. I am, however, very aware of how they will interact and keep these things in mind.

Gozala commented 11 years ago

Regards

Irakli Gozalishvili Web: http://www.jeditoolkit.com/

On Tuesday, 2013-07-02 at 19:36 , Tim Caswell wrote:

@Gozala (https://github.com/Gozala) I was wondering when/if you would comment on this thread. Yes, I agree that the only interface that needs to be specified is the readable stream. As far as using special tokens for END, ABORT, and Error classes, I'd rather not. instanceof Error doesn't work if the error is from another context. There is no Error.isError helper function though Object.prototype.toString.call(err) === "[object Error]" seems to be reliable. I'd hate to force such a verbose type check on each and every data chunk that goes through the stream. Having two positional arguments tells us a lot that speeds up such checks.

Actually you don't have to handle them at all it's matter of just having some sort of transform operation that passes meta values between input and output. For example you could have something like this folds:

https://github.com/Gozala/signalize/blob/master/core.js#L146-L172

Then all the filter map drop etc.. can be easily implemented without concerning self with either error checking or special value handling: https://github.com/Gozala/signalize/blob/master/core.js#L192-L284

Yes back-pressure can be done with a manual side-channel and pause and resume commands, but I much prefer the implicit backpressure provided by pull style. In my experience I'm much more likely to get it right if I'm using pull-streams than writing the back-pressure by hand using manual pause and resume.

Main issue with pull is it's inherently solver and you can't apply category theory to optimise transformation pipeline. And of course you can not represent streams that can't be paused or stopped like user events. That is why I prefer to decouple notion of stream from consumption semantics since pull is one of the ways there are more and from case to case you may want different ones. I have started writing some spec for push & pull signals that in best case perform as push and in worst case de-optimize to plain pull & of course any other case in between: https://gist.github.com/Gozala/5314269

It's slightly out of date though

Yes there are general helpers that convert between types. I am publishing a module right now called push-to-pull that lets you write the easier push filters, but use them as back-pressure honoring pull-filters without writing your own queues. A reduce transform could easily be written as could a filter transform. Thanks for the input.

— Reply to this email directly or view it on GitHub (https://github.com/creationix/js-git/issues/17#issuecomment-20392201).

dominictarr commented 11 years ago

There isn't gonna be one right answer to this stream thing. not with the languages we have today. Maybe in some future scifi language, but today, the best we can do hope to fit some fairly broad but non-exhaustive set of use-cases.

Anyway, it's not that hard to write custom stream stuff that there needs to be One True Stream. You can always convert from one to the other, and pick the stream best suits the way you think and the sort of programming you do.

creationix commented 11 years ago

@gozalla, I'm having trouble understanding you. I do like the idea of the main data channel being only data and letting everything else go through a meta channel.

I'm pretty sure I want pull based for several reasons. Besides the natural back-pressure it provides, it also provides a nice 1:1 mapping between continuations chains since each callback will be called only once for each read call. This makes tracing and error handling much easier. I'm currently working on improving domains in node.js and wish that everything in node had this nice 1:1 mapping. It makes for a very simple and robust system when every stack has a direct and obvious parent stack that initiated it.

I know that you can't pause some inputs easily (like user clicks or http requests), but that doesn't mean pull-streams are a bad idea. You just buffer the events at the source waiting for someone to pull them. Even those cases can usually be paused somewhat in extreme cases (you could disable the UI button if the stream wasn't ready to handle it or tell the TCP socket to stop accepting connections)

Already have two data channels in the form of two arguments to onRead callbacks (err, item). When item is anything other than undefined, then it's a data item. Otherwise, it's a meta value that signifies natural end or error end. The channel for closing an upstream source goes the other direction and so can't be encoded here.

Do you have any ideas that are modifications to the current design that could simplify this?