Engine chain - Githubissues

neboysa commented 7 years ago

Major update to the way engines are run. Instead of running every engine registered for a certain file type in no specific order, this update introduces engine_pipeline config section in lazy.yaml that defines the order in which each engine is run.

Engines can be run in parallel (batch) mode and in that case, their output is merged.

Engines can also be run sequentially, one after the other (sequence). In this case, the output of each engine, is passed ti the next engine in chain through context parameter.

Here is the example of lazy.yaml config:

# Each file is run through this pipeline. 
engine_pipeline:  
  batch:                      # batch: - run engines asynchronously (in paralel)
    - run: file-stats
    - sequence:               # pullreq engine depends on github-access, so we run them sequentially
      - run: github-access
      - run: pullreq
    - sequence:               # Linters. Run any and all linters
      - batch:                # in parallel,
        - run: emcc
        - run: css
        - run: html
        - run: eslint
        - run: yaml
        - run: stylelint
        - run: php-l
        - run: pmd-java
        - run: tidy-html
      - run: postp           # and apply postprocessor to their output

ierceg commented 7 years ago

@neboysa needs rebase

ierceg commented 7 years ago

@neboysa Commenting just on the design of the configuration (haven't seen code yet):

I love it.
I don't think batch is a good name. Maybe concurrent and sequential (though I do like sequence better than sequential). Maybe parallel and serial (async is widely used and uses that nomenclature and it's obviously same as parallel and serial ports :)
We might be able to offload language dispatching into an engine. In that case lazy has even less hard-coded logic.
Maybe we can skip run: and just have something like this:

engine_pipeline:  
  ~parallel:                      # run engines asynchronously (in parallel)
    - file-stats
    - ~serial:               # pullreq engine depends on github-access, so we run them sequentially
      - github-access
      - pullreq
    - ~ serial:               # Linters. Run any and all linters
      - ~parallel:                # in parallel,
        - emcc
        - css
        - html
        - eslint
        - yaml
        - stylelint
        - php-l
        - pmd-java
        - tidy-html
      - postp           # and apply postprocessor to their output

I'm not sure if this improves the readability - just throwing it out there for consideration. I wish Unix had a parallel symbol like it has for serial (|)

Should we have a single point of entry into engines? Right now we have POST /file but that doesn't mean anything for a hypothetical postp engine. So maybe just POST /process and then have lazy know what pipeline to use for what kind of requests.
In your example you put:

sequence:
  - github-access
  - pullreq

This paradigm could solve #46 and in general solve the problem of engines providing data to other engines. Pipelines are not as flexible as direct requests but they are also far less "coupling" (updated: it used to say "more" instead of "less")

neboysa commented 7 years ago

I have changed the lazy.yaml to be as @ierceg suggested (without run:):

engine_pipeline:
  batch:                      # batch: - run engines asynchronously (in paralel)
    - file-stats: 
    - sequence:               # pullreq engine depends on github-access, so we run them sequentially
      - github-access: 
      - pullreq: 
    - sequence:               # Linters. Run any and all linters
      - batch:                # in parallel,
        - emcc: 
        - css: 
        - html: 
        - eslint: 
        - yaml: 
        - stylelint: 
        - php-l: 
        - pmd-java: 
        - tidy-html: 
      - postp:                # apply postprocessor to their output
      - reducer:              # and, finally, limit the number of reported errors
          maxWarningsPerRule: 5     # allow up to 5 warnings for same rule-id
          maxWarningsPerFile: 300   # allow up to 150 warnings per file

neboysa commented 7 years ago

I have removed maxWarningPerFile and maxWarningPerRule from lazy. It is now part of the new lazy-reducer-engine.

ierceg commented 7 years ago

@neboysa what's the reasoning behind making engine lists in batch and sequence array of objects and not array of strings?

Also, batch... let's try to find something else:

sequence - cluster (yeah, not so great) sync - async (ambiguous) serial - parallel (I know you don't like it :) sequence - concurrence (ugh) serial - concurrent sequence - concur (https://www.merriam-webster.com/dictionary/concur definition no. 4)

neboysa commented 7 years ago

B/c each engine in sequence/batch now can have its own config properties. For example, reducer engine have maxWarnsPerRule and maxWarnPerFile params. This is different than "global" properties for each engine, since it allows each engine to be included multiple times in pipeline w/ different config.

Sent from my ZX Spectrum.

On Mon, Jan 9, 2017 at 4:55 PM +0100, "Ivan Erceg" notifications@github.com wrote:

@neboysa what's the reasoning behind making engine lists in batch and sequence array of objects and not array of strings?

Also, batch... let's try to find something else:

sequence - cluster (yeah, not so great)

sync - async (ambiguous)

serial - parallel (I know you don't like it :)

sequence - concurrence (ugh)

serial - concurrent

sequence - concur (https://www.merriam-webster.com/dictionary/concur definition no. 4)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ierceg commented 7 years ago

@neboysa ah... so each is a different instance of an engine? Nice. Data sink engines could be done like that: create a sink engine that whatever it gets puts into a data store, but then define the data store (or doc type, or table, or whatever) separately for each sink.

neboysa commented 7 years ago

Not really separate instances - there is one instance, but it is stateless. So it can behave differently on each request depending on the config. True for data stores - can be done just as you described: have one engine and many different configs in engine pipeline.

Sent from my ZX Spectrum.

On Mon, Jan 9, 2017 at 6:29 PM +0100, "Ivan Erceg" notifications@github.com wrote:

@neboysa ah... so each is a different instance of an engine? Nice. Data sink engines could be done like that: create a sink engine that whatever it gets puts into a data store, but then define the data store (or doc type, or table, or whatever) separately for each sink.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ierceg commented 7 years ago

How does that work for say ESLint engine whose processes are indeed stateless but configuration isn't?

neboysa commented 7 years ago

It doesn't :) But with some simpler engines it works. Eslint engine could be rewritten, but I doubt it is wort the effort.

Sent from my ZX Spectrum.

On Mon, Jan 9, 2017 at 10:35 PM +0100, "Ivan Erceg" notifications@github.com wrote:

How does that work for say ESLint engine whose processes are indeed stateless but configuration isn't?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ierceg commented 7 years ago

:) Yeah, me too. I think it would be better to have lazy simply instantiate engine on-demand and multiple times per definitions in the chain. #42 deals with on-demand instantiation. I'll add this as a future enhancement, we don't need it right now.

neboysa commented 7 years ago

I have changed batch to bundle in engine-pipeline section of lazy.yaml

SoftwareMarbles / lazy

Engine chain #54