Lint multiple files in parallel [$500]

ilyavolodin commented 8 years ago

This is a discussion issue for adding ability to run eslint in parallel for multiple files.

The idea is that ESLint is mostly CPU bound, not IO bound, so creating multiple threads (for machine with multiple cores) might (and probably will) increase performance in a meaningful way. The downside is that currently ESLint's codebase is synchronous. So this would require rewriting everything up to and including eslint.js to be asynchronous, which would be a major effort.

I played with this a little while ago and found a few libraries for Node that handle thread pool, including detection of number of cores available on the machine.

Node-threads-a-gogo - seems pretty good, but looks dead.
nPool - seems actively in development, but has native components (C++)
Node WebWorkers - seems pretty dead too.
Parallel - seems dead, and no pool implementation.
Node Clusters - not stable yet, and probably isn't going to be available on Node v0.10
WebWorkers - seems that they are only implemented in io.js And there are a ton of other libraries out there for this.

If anyone had any experience writing multithreaded applications for node.js and would like to suggest alternatives or comment on the above list, please feel free.

P.S. https://www.airpair.com/javascript/posts/which-async-javascript-libraries-should-i-use

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

BYK commented 8 years ago

Nice. I'm very interested in trying this myself too.

ilyavolodin commented 8 years ago

Another question that I had in my mind, if we need to rewrite everything to be async, should we use callback pattern? Promises? If so which library? Q or Bluebird? I personally would prefer promises to callback hell.

IanVS commented 8 years ago

I vote for promises. Bluebird is fast, but makes me nervous because it adds methods to the native Promise, which is likely not a great idea. I think Q might be the best bet. There is also Asequence, but I have no personal experience with it.

gyandeeps commented 8 years ago

Why not use built in promises? Just a question as I have no experience with promises yet.

IanVS commented 8 years ago

They're not supported in node 0.10, to my knowledge.

Besides that, the libraries give some nice "sugar" methods when working with Promises.

btmills commented 8 years ago

I've had plenty of success using native promises (or a polyfill when native promises aren't supported). That seems like a good starting point to me; if we need more than they provide we could probably swap out something that's API-compatible.

nzakas commented 8 years ago

I think we're putting the cart before the horse here. Let's hold off on promises vs. callbacks until we're at least ready to prototype. Get something working with callbacks and let's see how bad it is (or not).

lo1tuma commented 8 years ago

The idea is that ESLint is mostly CPU bound, not IO bound

ESLint also does a lot of IO (directory traversal, reading source files). So I think we would also profit here if we rewrite eslint to do non-blocking IO.

ilyavolodin commented 8 years ago

@lo1tuma I haven't profiled it yet, but in my mind, amount of IO we do is negligible comparing to amount of CPU cycles we eat. I will try to profile it and post results here if I will get anything meaningful.

pnstickne commented 8 years ago

Using something like NodeClusters - or most other per-process implementations - would avoid the issue of needing to [vastly] rewrite ESLint. (Such implementations are strictly not threading, but allow managed parallel process execution.)

It would mostly just need to IPC-in/out the current ESLint; ESLint parallelism would then be able to work freely over different files in a per-process manner, but it could not (without more rework) run concurrently over a single file.

Thus if the goal is to run ESLint over different files in parallel I would urge such a 'simple' per-process concurrency approach. If the goal is to make ESLint parallel across the same source/AST then .. that is a more complicated can of worms as it changes the 'divergence' point.

If there is a strict v0.10 node target for ESLint, maybe have this as an feature when running a compatible node version.

mysticatea commented 8 years ago

My idea is:

The master process does queue control and collecting result.
Worker processes have a short queue (the length is about 2-4). A worker does:
- Sends the result of the last file and requests a path of the next next next file to the master.
- Reads asyncly the next next file.
- Lints the next file.

There are source codes which has a variety of size, so I think the queue control in pull-style is important.

Library.... I think child_process.fork is enough.

gyandeeps commented 8 years ago

Sorry for this question: Based on all the comments, do we think the reward of this functionality is worth the effort/changes/complications? (like I said just a question) Or is this functionality too early (like google glass) to implement without any actual estimates or ask or whatever.

pnstickne commented 8 years ago

@gyandeeps My projects are not big enough, or computer slow enough, for me to really care either way.

In cases where there are sizable projects of many files, on computers with several cores and non-bound I/O, I imagine that it could lead to significantly reduced wall-clock, approaching Amdal's law. I would be less optimistic about this gain with fewer larger files, even with 'actual threading' or post-AST handling - but that is what performance profiles are for.

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens.

ilyavolodin commented 8 years ago

@gyandeeps To answer your question - we do not have definitive information on this. Right now my assumption is that we are CPU-bound, not IO. In that case utilizing more cores should have significant impact on larger projects (as @pnstickne mention, impact will be small or there might even be negative impact on a few large files). I think the first step of the process would be to prove or disprove my assumption on CPU vs IO. Granted, if it turns out I'm wrong and we are IO-bound, then changing ESLint code to async would improve performance anyways.

@pnstickne Thanks for the insights. I'm really not familiar with NodeClusters, just know that they exist and do not require everything to be async for them to act. We definitely not going to try to make ESLint run AST analysis in parallel, that's going to be a huge change, and a lot of our rules expect nodes to be reported in the right order, so we would not only need to rewrite pretty much the whole core, we would also need to rewrite a lot of rules. I would be interested in learning more how we could incorporate code that would only execute on node 0.12 and not on earlier version. I'm not sure that's the approach I would like to take, but it's an interesting option never the less.

ilyavolodin commented 8 years ago

So I did some very unscientific performance analysis on both Windows and OSX. I specifically chose to lint a very large number of files (TodoMVC, 1597 files, 854 folders). On Windows, results where very inconclusive. Basically my CPU usage was never over 15% (on 8 core machine, none of the cores were overtaxed at any point). But then my I/O never hit anything over 10MB/s on a SSD that's theoretically capable of 550 MB/s. So I have no idea why it didn't run any faster. On OSX it never hit more then 15% CPU (I have no idea how many cores my macbook has, probably 8 too), but I/O was pretty taxed. It looked like there were a few spikes that were reaching 100% disk throughput. So maybe my assumption was wrong any we are not CPU bound?

pnstickne commented 8 years ago

@ilyavolodin Try this:

Start running 8 different eslint processes (over the same set of files should be fine, although it would be 'more fair' to break up the files into 8 different equally-sized sets).

Compare the wall-clock times it takes for 1 process to complete and 8 processes to complete.

The 8 processes will have done 8x the work (if using the same set of files for each process as for the single process) or the same amount of work (if having split the source files among them), but in how much x the time?

This very crudely should show an approximate gain - if any - for using multi-process concurrency.

platinumazure commented 8 years ago

Late to the conversation, but... Is anyone opposed to someone starting up a pull request to implement the callback hell approach (i.e., make ESLint async across the board, including for all I/O operations)? Seems like it would make the eventual parallelization easier anyway.

IanVS commented 8 years ago

Of course another option is to only lint 'changed' files and provide some sort of result cache, but comes with additional data management burdens. -@pnstickne

This is essentially what ESLinter from @royriojas does.

pnstickne commented 8 years ago

@IanVS That's pretty cool .. now if only it was built-in to my eslint grunt task :}

(Okay, I could get it shimmed in pretty easy - but it'd still be nice to see a 'done package'.)

ilyavolodin commented 8 years ago

@pnstickne #2998 @platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

platinumazure commented 8 years ago

@ilyavolodin That's fair enough. I'm wondering, though, would it be worth creating a separate issue with the goal of making ESLint's I/O operations asynchronous? Synchronous I/O is a bit of an anti-pattern in Node and it might help us figure out just how much of a breaking change parallelization would likely be. On Aug 30, 2015 9:46 AM, "Ilya Volodin" notifications@github.com wrote:

@pnstickne https://github.com/pnstickne #2998 https://github.com/eslint/eslint/issues/2998 @platinumazure https://github.com/platinumazure I think I would like to first prove that there's something to gain by going async/multithreaded just so nobody would waist their time creating a very large PR that will then be closed, because there is no gain.

— Reply to this email directly or view it on GitHub https://github.com/eslint/eslint/issues/3565#issuecomment-136151391.

ilyavolodin commented 8 years ago

@platinumazure While sync code is node anti-pattern, in our case, we can't do anything with the code (as in parse it into AST) until we read the whole file. So if improving performance is not on the plate, then changing code to async will increase complexity of the code, but not gain us anything. It's worth testing out and I still think there's a performance gain that we can get by doing parallel execution, but I would like to get some proof of that first.

lo1tuma commented 8 years ago

@ilyavolodin reading files asynchronously doesn’t mean you read them chunk-by-chunk. While you are waiting for one file to be read, you can lint a different file which has been read already.

nzakas commented 8 years ago

Wall clock time is the only metric that matters for this evaluation. We have had people report they were using ESLint on projects with thousands of files, so if there's a benefit, it will be seen.

The thing to keep in mind is that directory traversal must be done before linting can begin. Thus, we get zero benefit from switching to an async directory traversal. We need to build up the list of files to lint and then fork off to start linting, at which point each process is doing a file read, and there's no benefit to async there either.

Overall, I'd just like to squash the "rewrite to make things async" approach. We should be able to make a change in CLIEngine to do this without a major rewrite.

ilyavolodin commented 8 years ago

Ok, I think I finally figured out a good way to check if ESLint is CPU bound vs. IO. What I did is I ran eslint on the same set of files (544 files in 234 directories) first with just one rule on (`yoda: [2, "never"]') and then with .eslintrc file from ESLint repository (so a lot of rules on). As far as I can tell, the amount of IO should be exactly the same in both cases, but number of CPU cycles will be significantly different. This doesn't mean that IO is not a problem, but at least it's something. And then numbers where as follows:

Test #1: 9.630s
Test #2: 1:27.880s

Based on dramatic difference in performance, I want to say that we are not IO bound. Next step would be to create threaded prototype of some sort and see what the difference in performance is going to be.

pnstickne commented 8 years ago

I'm still for 'per process' concurrency.

ilyavolodin commented 8 years ago

Ok, I created a very simple prototype of "process-pool" that will spawn a bunch of processes based on the number of cores available and send files to those processes for linting. I had to turn off reporting (formatters are not invoked) but here are the numbers: Single process linting on 544 files in 234 directories took 9.585 seconds. Same files took 4.168 seconds over 8 processes (I have 8 cores). I still need to do some major cleanup on "process-pool" and everything up to and including CLIEngine needs to be turned into async code, but it looks like it might be worth it.

royriojas commented 8 years ago

@ilyavolodin, is your prototype public? would love to see it. Currently the cache module uses a statSync, for simplicity sake. but it would be nice to see if that can be done async as well

ilyavolodin commented 8 years ago

@royriojas Not yet, I only wrote pooling library for child_process and even that needs some major cleanup before I would be comfortable posting it anywhere:-) I didn't convert anything to async code yet, I just added process.exit(0) once all of the linting is done (so no reporting of any kind for now). I also commented out all of the cache code for this test, because it was hard to find a spot to plug my code in with caching on.

akras14 commented 8 years ago

Just wanted to chime in. We have a medium size project, linting 1370 (larger) files with a good number of rules in a powerful VM, and it takes about 30 seconds. It's manageable, but painful.

ilyavolodin commented 8 years ago

Here's an update on where this is at:

I have created ProcessPool library and it's checked in to multithreading branch
In order to get this to work properly with existing codebase, cli and cli-engine needs to be refactored. Cli-engine currently holds the main function for linting called executeOnFiles. This is the main entry point, and should be refactored to allow for creation of another entry point called executeOnFilesAsync. In order to do that, this function should be streamlined to have a single entry point and a single exit point. All of the helper functions should be refactored to be that - helper, send something in, get a return value and then branch the code based on the return. I've tried to refactor this multiple times and hit the wall every time. To make my point clear here's what the current stack trace would look like:

cli.js - execute
  cli-engine.js - executeOnFiles
    cli-engine.js - executeOnFile
      cli-enigne.js - executeOnText
        cli-engine.js - processText
          eslint.js - verify

Here's what it should be like to make it easy to do async functions:

cli.js - execute
  cli-engine.js - executeOnFiles
      eslint.js - verify

Everything that was in between should be made into a helper function. Since spawned processes can only execute a module, it should be as simple as possible, preferably just pure return eslint.verify(text, config, filename)

nzakas commented 8 years ago

Aren't executeOnFile, executeOnText, and processText already helpers?

ilyavolodin commented 8 years ago

They are not, they are part of the main stack trace. executeOnFiles calls into executeOnFile, which calls into executeOnText, etc. until the call into verify. The point is, for making async function the depth of the stack trace has to be minimal, otherwise, every step of the stack trace has to be turned async as well.

nzakas commented 8 years ago

Sorry, I'm still confused. Let's start by defining terms. What do you mean by "helper"?

ilyavolodin commented 8 years ago

A helper function (in my own made-up definition) is a function that takes a set of arguments and returns back a value, it can call other functions, but it cannot be on the main execution path of the code. For example, in the current version of cli-engine I consider calculateStatsPerFile, calculateStatsPerRun, createIgnoreResult, etc. to be helper functions, they are called by a function on the main execution path, but they are not part of the main execution path. Helper functions can be kept sync even when writing async code. But functions on the main execution path has to be async. Basically, if you were to run eslint in a debugger and put a break point in eslint.verify method, anything that will be displayed in the stack trace once you hit that break point has to have async counterpart. I'm trying to limit the number of async functions that need to be created, and at the same time, not have a lot of logic in those functions, so I wouldn't have to duplicate that logic between sync/async counterparts.

nzakas commented 8 years ago

Got it, thanks for explaining

royriojas commented 8 years ago

hi @ilyavolodin,

So here is my take, I do agree with you that we should probably move the code for the cache to a better place where it is now (executeOnFiles). In a perfect world the process should be something like this:

In the main process:

define which files need to be linted.
if cache is enabled use the cache to filter out the ones that haven't changed since they pass with 0 lint errors
with the new list of files create a queue of files to be linted
spawn some processes to run the actual linter

In the forked process

lint the code (no cache work on this side, just linting)
send back the result
wait for more files to lint

In the main process

receive the results and collect them
if cache is enabled update the cache entry for the file, if it pass stored if, it it didn't pass removed it from the cache. so the next time the process start the file will be linted
if more files require linting pass the new file to the spawned process
wait until 0 files
print the collected results
persist the cache to disk
be happy with faster linting!

I don't know if this will be feasible, but I guess it will make the process simpler in order to also use the cache.

Let me know what you think about it.

ilyavolodin commented 8 years ago

@royriojas That's exactly the way I was thinking about it. It's a bit more complicated though, since not just cache stuff needs to be refactored. Currently there are too many things happening on the main path of execution (for example, the list of files contains just filenames, for each one config needs to be constructed, content of the file has to be read, etc.). If all of those things will be turned into helper methods, doing parallel linting will actually be pretty straight forward.

royriojas commented 8 years ago

@ilyavolodin I see, so instead of list of files is a list of objects with the filepath and the config object already inferred. right?

We are already taking the config in consideration for the cache, so once the methods are refactored adding the cache back shouldn't be an issue. I would say if you have a working branch on this, just remove the cache code, we can add it back to the right place.

My only concern is the issue #4607, since it is requesting to move the cache inside the executeOnText method and you're suggesting it should be outside of it. Maybe we need a wrapper for it that can also include the cache so the original remains untoched?

ilyavolodin commented 8 years ago

Ideally, what I think should be done, is cache should be turned into a utility, and moved into it's own module. Each method that's publicly exposed should be doing one operation only (for example, given config object and file name, determine if the file is already cached and return true/false). This will allow caching to be used in async and sync executeOnFiles as well as in executeOnText without any duplication of code.

royriojas commented 8 years ago

@ilyavolodin I see, it is currently pretty much a very small amount of code for the cache. So moving it to a helper shouldn't be complicated if #4607 is accepted then I can do that as part of the work on that issue.

ilyavolodin commented 8 years ago

This look like it would actually be great to use for parallel linting: https://github.com/SyntheticSemantics/ems However, I don't think we could do that, since it's a native extension and currently only works on Linux/OSX:-(

elodszopos commented 8 years ago

Is there any progress going on in here? I would love being able to split up linting to share the load among cores.

ilyavolodin commented 8 years ago

@elodszopos No, sorry, no progress. This feature was moved out of v2 timeline and will come in at a later date.

AGhost-7 commented 7 years ago

For those looking for a temporary solution I wrote this which works for my purposes:

Master process: https://github.com/AGhost-7/dyna-table/blob/master/linter-run.js

Worker processes: https://github.com/AGhost-7/dyna-table/blob/master/linter-worker.js

Makes a very big difference with my quad-core processor.

ilyavolodin commented 7 years ago

@AGhost-7 Thanks for sharing. Very nice! The only problem is you are bypassing our glob resolution and all of the cli switches, but I'm sure for your use-case it would be perfectly fine. In our case, we need to continue to support all CLI switches as well as glob resolution, so it's a bit more complicated.

P.S. Edge-case, but since you are loading fork processes with 10 files at a time, you might get into a situation where first 10 files are all huge, and second 10 files are all small, so second process is going to finish up much faster then first one. You can take a look at https://github.com/eslint/eslint/blob/multithreading/lib/process-pool.js it should balance execution equally between all of the processes (more or less).

AGhost-7 commented 7 years ago

Yea its just a quick throwtogether to spread the workload for now. Far from perfect.

ps. I think the biggest file I have is 300 lines or so anyways.

drifkin commented 7 years ago

@nzakas I'm looking into some perf stuff involving an automation tool that uses eslint. Earlier in this thread you mentioned:

The thing to keep in mind is that directory traversal must be done before linting can begin. Thus, we get zero benefit from switching to an async directory traversal.

Is this due to how reporters work, or is this strict ordering (full directory traversal before linting) due to something else?

(By the way, thanks so much for building such a great tool!)

nzakas commented 7 years ago

@drifkin we have to calculate configuration for each file before it can be linted. So we have to go through the trouble of looking for all the configuration files, starting in the file to be linted's directory and working our way back up the directory structure. At that point, you've calculated the configuration for everything in that directory as well.

Then we have to apply the .eslintignore filter over each potential file to lint.

My main goal here was to discourage a complete rewrite of our CLI for questionable gains (it's complicated enough as it is). Maybe there's some intricate way to interweave directory traversal and linting to take advantage of async, but I don't see it. In general, I don't think async is useful for CLI utilities.

AGhost-7 commented 7 years ago

You don't need to have the workers calculate the configs. Workers shouldn't do the management IMO; in fact I believe the better way is to keep workers "dumb". What I'm thinking is sending for each file to lint as a message containing the processed configuration and the file's absolute path.

eslint / eslint

Lint multiple files in parallel [$500] #3565