dat-ecosystem-archive / gasket

Build cross platform data pipelines [ DEPRECATED - More info on active projects and modules at https://dat-ecosystem.org/ ]
191 stars 19 forks source link

Data Pipeline Configuration and Datscript Proposal #17

Open melaniecebula opened 9 years ago

melaniecebula commented 9 years ago

Data Pipeline Configuration and Datscript Proposal


Goal

Create a data pipeline configuration that makes sense. This involves:

Pipeline: datscript --> hackfile parser --> hackfile --> gasket

Datscript


Keywords

Command-Types:

run: runs following commands serially pipe: pipes following commands together fork: runs following commands in parallel, next command-type waits for these commands to finish background: similar to fork, but next command-type does not wait for these commands to finish map: multiway-pipe from one to many; pipes first command to rest of commands reduce: multiway-pipe from many to one; pipes rest of commands to first command

Other Keywords:

pipeline: keyword for distinguishing a pipeline from other command-types

Datscript Syntax


Command-type {run, pipe, fork, background, map, reduce} followed by args in either of the two formats:

Format 1:

{command-type} {arg1}
  {arg2}
  {arg3}
  ....

Format 2:

{command-type}
  {arg1}
  {arg2}
  {arg3}
  ....

pipeline {pipeline-name} followed by either of the previous command-type formats:

pipeline {pipeline-name}
    {command-type}
      {arg1}
      {arg2}
      {arg3}
      ....  

Commands in detail


Run Command:

run will run each command serially; that is, it will wait for the previous command to finish before starting the next command.

The following all result in the same behavior, since the run command is serial:

Example 1:

run bar
run baz

Example 2:

run
  bar
  baz

Example 3 (not best-practice):

run bar
  baz

Pipe Command:

pipe will pipe each command together; that is, it will take the first command and pipe it to the next command until it reaches the end, and pipe to std.out. pipe with only one supplied command is undefined.

Example 1: prints "A" to std.out

pipe
  echo a
  transform-to-uppercase
  cat

Example 2: prints "A" to std.out

pipe echo a
  transform-to-uppercase
  cat

Example 3: INVALID because both transform-to-uppercase and cat need input (and since these are separate groupings, these lines are NOT piped together)

pipe echo a
pipe transform-to-uppercase
pipe cat

Example 4: prints "A" to std.out, prints "B"to std.out

pipe
 echo a
  transform-to-uppercase
  cat
pipe
  echo b
  transform-to-uppercase
  cat

Fork Command:

fork will run each command in parallel in the background. The next command-type will wait for these commands to finish. If there is no next command-type, gasket will implicitly wait for these commands to finish before exiting. Each forked command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

fork
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

fork echo a
  echo b
run echo baz

Example 3: Will print a and b to std.out (in either order), before exiting.

fork
  echo a
  echo b

Background Command

background will run each command in parallel in the background. The next command-type will NOT wait for these commands to finish. If there is no next command-type, gasket will NOT wait for these commands to finish before exiting. Each background command is not guaranteed to be executed in the order you supply.

Example 1 (best-practice):

background
  echo a
  echo b
run echo baz

Example 2: Same output as Example 1 (not best-practice)

background echo a
  echo b
run echo baz

Example 3: Starts a node server, run echo a does not wait for run node server.js to finish. After completing the last command (in this case, run echo a , gasket will NOT wait for background commands ( run node server.js)to finish, but will properly exit them.

background
  run node server.js
run echo a

Map Command

map is a multiway-pipe from one to many. That is, it pipes the first command to rest of the provided commands. The rest of the provided commands are treated as fork commands. Therefore, the "map" operation pipes the first command to the rest of the provided commands in parallel (and therefore no order is guaranteed). map with only one supplied command is undefined.

Example 1 (best-practice): In either order:

  map curl http://data.com/data.json
      dat import
      cat

Example 2: Same output as Example 1

  map 
      curl http://data.com/data.json
      dat import
      cat

Reduce Command

reduce is a multiway-pipe from many to one. That is, it pipes rest of commands to first command. The rest of the provided commands are treated as fork commands. Therefore, the "reduce" operation pipes each of the provided commands to the first command in parallel (and therefore no order is guaranteed). reduce with only one supplied command is undefined.

Example 1 (best-practice): In either order:

  reduce dat import
      papers
      taxonomy

Example 2: Same output as Example 1

  reduce
      dat import
      papers
      taxonomy

Defining and Executing a Pipeline

The pipeline keyword distinguishes a pipeline from the other command-types. Pipelines are a way of defining groups of command-types that can be treated as data (a command) to be run by any command-type.

Example 1: An import-data pipeline is defined. It imports 1, 2, 3 in parallel before printing "done importing" to std.out. After converting from datscript to gasket, to run the pipeline in the command line gasket run import-data

pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

Example 2: Same output as Example 1, but run from within the datscript file.

run import-data
pipeline import-data
  fork
    import 1
    import 2
    import 3
  run echo done importing

You cannot nest pipeline definitions (they should always be at the shallowest layer), but you can nest as many command-types within a pipeline as you like.

Example 3: Nested command-types in a pipeline. Will print a, then print b, then print C

pipeline baz
  run
    echo a
    echo b
    pipe
      echo c
      transform-to-upper-case
      cat

Example 4: INVALID: Pipelines can only be defined at the shallowest layer.

pipeline foo
  run echo a
  pipeline bar

//TODO: Lots of tricky cases to think about here. Example 6: Executing non-run command-types on a pipeline. In this example, we define a pipeline foo, which has baz and bat defined (without a command-type provided). Then we map bar on to the pipeline foo (so we pipe bar into baz and also into bat, in parallel). One problem here, is pipeline foo might be invalid syntax.

map bar
  foo

pipeline foo
  baz
  bat

Misc

This issue is still a WIP. A lot of this concerns datscript directly, but will ultimately shape gasket (so I think it belongs here)

melaniecebula commented 9 years ago

@karissa brought up a good point. Is there any distinction between a single pipe command and a run command? Thoughts @mafintosh @maxogden ?

max-mapper commented 9 years ago

maybe a pipe with a single command should cause a warning/error that says something like warning on line 7: pipe should only be used with multiple commands, otherwise use run etc

okdistribute commented 9 years ago

it sounds like in those cases we should just only use run in the documentation. if we don't have any examples using pipe on a single line it'll do most of the work for us

@maxogden that isn't a bad idea, could be nice to have a --verbose option that outputs stuff like this. but that lies more in feature request territory

melaniecebula commented 9 years ago

I can edit the documentation to say that one-line pipe commands are not officially supported (and then edit the examples that used them)

mafintosh commented 9 years ago

+1 for removing one-line pipes from docs

okdistribute commented 9 years ago

@melaniecebula yeah, that seems like that could go in the detailed documentation about the 'pipe' command

melaniecebula commented 9 years ago

Okay, I think that'll clean up some of the confusion for one-line map and one-line reduce commands as well.

melaniecebula commented 9 years ago

made changes: Added notes about undefined behavior for pipe, map, and reduce when only supplied with one command. Removed one-line pipe/map/reduce examples

max-mapper commented 9 years ago

inspiration: https://github.com/toml-lang/toml

max-mapper commented 9 years ago

I thought about the pipeline foo: syntax some more, and I kind of think we should drop the : and just have it be pipeline foo

Reasoning is that it's the only 'special' syntax we have, and on the call yesterday we came up with it because it adds a second namespace of commands which makes things more futureproof.

But I think it's a little too complex for a first version, and you can get most problems by being wise about reserving keywords in the design of your DSL.

Relevant IRC:

screen shot 2015-01-10 at 6 27 41 pm

melaniecebula commented 9 years ago

That makes sense to me! I agree.

melaniecebula commented 9 years ago

I've updated the issue to reflect dropping the ":", but keeping "pipeline" as a keyword.

max-mapper commented 9 years ago

excellent, in the interest of simplicity I think we should try and keep any 'special' syntax out of the first version of hackfiles (this includes argument placeholders like $1 for now). So @melaniecebula if you wanna take a stab at forking mafintosh/hackfile-parser that would probably be a good place to start

melaniecebula commented 9 years ago

I agree. I think it's something that gasket can handle instead (the pipeline keyword and details like that). Sounds good! I plan on messing around with it after I get some lunch!

okdistribute commented 9 years ago

A ++1 to not having special syntax stuff in hack files On Jan 11, 2015 11:26 AM, "Melanie Cebula" notifications@github.com wrote:

I agree. I think it's something that gasket can handle instead (the pipeline keyword and details like that). Sounds good! I plan on messing around with it after I get some lunch!

— Reply to this email directly or view it on GitHub https://github.com/datproject/gasket/issues/17#issuecomment-69507202.

max-mapper commented 9 years ago

@melaniecebula I think @mafintosh and I discussed it and figured 'fork' and 'background' would be implemented the same, using npm-execspawn (similar to child_process.spawn). the actual specifics of the implementation we didnt discuss though. Might need some sort of process cluster/state module (using e.g. require('run-parallel') or something)