Open melaniecebula opened 9 years ago
@karissa brought up a good point. Is there any distinction between a single pipe command and a run command? Thoughts @mafintosh @maxogden ?
maybe a pipe with a single command should cause a warning/error that says something like warning on line 7: pipe should only be used with multiple commands, otherwise use run
etc
it sounds like in those cases we should just only use run
in the documentation. if we don't have any examples using pipe on a single line it'll do most of the work for us
@maxogden that isn't a bad idea, could be nice to have a --verbose option that outputs stuff like this. but that lies more in feature request territory
I can edit the documentation to say that one-line pipe commands are not officially supported (and then edit the examples that used them)
+1 for removing one-line pipes from docs
@melaniecebula yeah, that seems like that could go in the detailed documentation about the 'pipe' command
Okay, I think that'll clean up some of the confusion for one-line map and one-line reduce commands as well.
made changes: Added notes about undefined behavior for pipe, map, and reduce when only supplied with one command. Removed one-line pipe/map/reduce examples
inspiration: https://github.com/toml-lang/toml
I thought about the pipeline foo:
syntax some more, and I kind of think we should drop the :
and just have it be pipeline foo
Reasoning is that it's the only 'special' syntax we have, and on the call yesterday we came up with it because it adds a second namespace of commands which makes things more futureproof.
But I think it's a little too complex for a first version, and you can get most problems by being wise about reserving keywords in the design of your DSL.
Relevant IRC:
That makes sense to me! I agree.
I've updated the issue to reflect dropping the ":", but keeping "pipeline" as a keyword.
excellent, in the interest of simplicity I think we should try and keep any 'special' syntax out of the first version of hackfiles (this includes argument placeholders like $1
for now). So @melaniecebula if you wanna take a stab at forking mafintosh/hackfile-parser that would probably be a good place to start
I agree. I think it's something that gasket can handle instead (the pipeline keyword and details like that). Sounds good! I plan on messing around with it after I get some lunch!
A ++1 to not having special syntax stuff in hack files On Jan 11, 2015 11:26 AM, "Melanie Cebula" notifications@github.com wrote:
I agree. I think it's something that gasket can handle instead (the pipeline keyword and details like that). Sounds good! I plan on messing around with it after I get some lunch!
— Reply to this email directly or view it on GitHub https://github.com/datproject/gasket/issues/17#issuecomment-69507202.
@melaniecebula I think @mafintosh and I discussed it and figured 'fork' and 'background' would be implemented the same, using npm-execspawn (similar to child_process.spawn). the actual specifics of the implementation we didnt discuss though. Might need some sort of process cluster/state module (using e.g. require('run-parallel')
or something)
Data Pipeline Configuration and Datscript Proposal
Goal
Create a data pipeline configuration that makes sense. This involves:
Creating and refining a datscript file format (outlined here, previously discussed: https://github.com/datproject/discussions/issues/16)Making changes to hackfile parser to parse datscript correctlyPipeline: datscript --> hackfile parser --> hackfile --> gasket
Datscript
Keywords
Command-Types:
run: runs following commands serially pipe: pipes following commands together fork: runs following commands in parallel, next command-type waits for these commands to finish background: similar to fork, but next command-type does not wait for these commands to finish map: multiway-pipe from one to many; pipes first command to rest of commands reduce: multiway-pipe from many to one; pipes rest of commands to first command
Other Keywords:
pipeline: keyword for distinguishing a pipeline from other command-types
Datscript Syntax
Command-type {run, pipe, fork, background, map, reduce} followed by args in either of the two formats:
Format 1:
Format 2:
pipeline {pipeline-name} followed by either of the previous command-type formats:
Commands in detail
Run Command:
run will run each command serially; that is, it will wait for the previous command to finish before starting the next command.
The following all result in the same behavior, since the run command is serial:
Example 1:
Example 2:
Example 3 (not best-practice):
Pipe Command:
pipe will pipe each command together; that is, it will take the first command and pipe it to the next command until it reaches the end, and pipe to std.out. pipe with only one supplied command is undefined.
Example 1: prints "A" to std.out
Example 2: prints "A" to std.out
Example 3: INVALID because both transform-to-uppercase and cat need input (and since these are separate groupings, these lines are NOT piped together)
Example 4: prints "A" to std.out, prints "B"to std.out
Fork Command:
fork will run each command in parallel in the background. The next command-type will wait for these commands to finish. If there is no next command-type, gasket will implicitly wait for these commands to finish before exiting. Each forked command is not guaranteed to be executed in the order you supply.
Example 1 (best-practice):
Example 2: Same output as Example 1 (not best-practice)
Example 3: Will print a and b to std.out (in either order), before exiting.
Background Command
background will run each command in parallel in the background. The next command-type will NOT wait for these commands to finish. If there is no next command-type, gasket will NOT wait for these commands to finish before exiting. Each background command is not guaranteed to be executed in the order you supply.
Example 1 (best-practice):
Example 2: Same output as Example 1 (not best-practice)
Example 3: Starts a node server, run echo a does not wait for run node server.js to finish. After completing the last command (in this case, run echo a , gasket will NOT wait for background commands ( run node server.js)to finish, but will properly exit them.
Map Command
map is a multiway-pipe from one to many. That is, it pipes the first command to rest of the provided commands. The rest of the provided commands are treated as fork commands. Therefore, the "map" operation pipes the first command to the rest of the provided commands in parallel (and therefore no order is guaranteed). map with only one supplied command is undefined.
Example 1 (best-practice): In either order:
Example 2: Same output as Example 1
Reduce Command
reduce is a multiway-pipe from many to one. That is, it pipes rest of commands to first command. The rest of the provided commands are treated as fork commands. Therefore, the "reduce" operation pipes each of the provided commands to the first command in parallel (and therefore no order is guaranteed). reduce with only one supplied command is undefined.
Example 1 (best-practice): In either order:
Example 2: Same output as Example 1
Defining and Executing a Pipeline
The pipeline keyword distinguishes a pipeline from the other command-types. Pipelines are a way of defining groups of command-types that can be treated as data (a command) to be run by any command-type.
Example 1: An import-data pipeline is defined. It imports 1, 2, 3 in parallel before printing "done importing" to std.out. After converting from datscript to gasket, to run the pipeline in the command line gasket run import-data
Example 2: Same output as Example 1, but run from within the datscript file.
You cannot nest pipeline definitions (they should always be at the shallowest layer), but you can nest as many command-types within a pipeline as you like.
Example 3: Nested command-types in a pipeline. Will print a, then print b, then print C
Example 4: INVALID: Pipelines can only be defined at the shallowest layer.
//TODO: Lots of tricky cases to think about here. Example 6: Executing non-run command-types on a pipeline. In this example, we define a pipeline foo, which has baz and bat defined (without a command-type provided). Then we map bar on to the pipeline foo (so we pipe bar into baz and also into bat, in parallel). One problem here, is pipeline foo might be invalid syntax.
Misc
This issue is still a WIP. A lot of this concerns datscript directly, but will ultimately shape gasket (so I think it belongs here)