daisy / ace

Ace by DAISY, an Accessibility Checker for EPUB
MIT License
77 stars 22 forks source link

node process recurrent locks in production, urgent! #286

Open laudrain opened 5 years ago

laudrain commented 5 years ago

Please provide the following details if possible or relevant:

Using Ace from the command line on the production server, there are recurrent locks of the node.exe process, at least once a day. Our team has to kill the process for next validations to run on following EPUBs. Attached is the screenshot of node.exe process stopped.

image001

Any clue or checks we should try to figure out what's going wrong ?

rdeltour commented 5 years ago

Thanks for the report @laudrain. @danielweck does it ring any bell? Would you have time to look into this?

danielweck commented 5 years ago

I see an executable hierarchy: prowin32.exe > java.exe > node.exe. What does that represent?

danielweck commented 5 years ago

The official Ace 1.1.1 release is based on the master branch, which uses old versions of package dependencies (such as the Winston logger). It would be interesting to see if the more up-to-date Ace "core" lib stack used by AceApp (i.e. ace-gui repository) causes the same problems. See this Pull Request: https://github.com/daisy/ace/pull/236/files

Unfortunately, at the moment there is no unofficial build of the Ace "core" (more specifically, Ace CLI) which ships in AceApp. This would allow advanced users / early adopters to run tests and compare.

Well, in addition AceApp 1.0.0-rc.1, there are also "nightly" builds (compiled directly from the master branch), so this could in principle be used, but in practice running the Ace CLI from this bundle is really convoluted: https://github.com/daisy/ace-gui/releases

So, I think "we" (i.e. the Ace dev team at large) should put into place a "next" NPM release tag of Ace "core" (e.g. 1.2.0-alpha.1, based on the aforementioned Pull Request) alongside the "latest" NPM tag (i.e. 1.1.1, based off the master branch), so that people can try the updated Ace CLI.

Romain, what do you think?

rdeltour commented 5 years ago

So, I think "we" (i.e. the Ace dev team at large) should put into place a "next" NPM release tag of Ace "core" (e.g. 1.2.0-alpha.1, based on the aforementioned Pull Request) alongside the "latest" NPM tag (i.e. 1.1.1, based off the master branch), so that people can try the updated Ace CLI.

Yes, sounds good to me. "Beta", even.

danielweck commented 5 years ago

...there are recurrent locks of the node.exe process, at least once a day. Our team has to kill the process... Attached is the screenshot of node.exe process stopped.

Interesting. So the process just waits indefinitely ... sounds like the culprit could be Winston (the file / console logger) awaiting the termination of a write stream. Perhaps this is caused by concurrent file access. Windows 2008 R2 is a pretty specific environment, but generally-speaking (from development experience) the Windows filesystem(s) can cause such problems.

https://github.com/daisy/ace/blob/d5e7be6a084c766da1ec23f40622fb6dd403f323/packages/ace-logger/src/index.js#L43-L72

Once again, note that the updated Ace branch (used by Ace App / ace-gui) utilizes an updated strategy to wait for file stream closure (in order to work around a Winston bug):

https://github.com/daisy/ace/blob/72047a2423c17bf05f6621d418e1dc0668eb0e83/packages/ace-logger/src/index.js#L11-L38

...perhaps this could improve things.

danielweck commented 5 years ago

Related (identical?) issue: https://github.com/daisy/ace/issues/284

laudrain commented 5 years ago

The error massage in #284 was produced by an interactive attempt to run Ace from a terminal command line. It showed the that the ace.log file was locked by the node.exe process still running (locked). But this doesn't explain why this node.exe process was locked.

laudrain commented 5 years ago

@danielweck your proposal for improvement means that we should wait for a new release?

danielweck commented 5 years ago

It showed the that the ace.log file was locked by the node.exe process still running (locked). But this doesn't explain why this node.exe process was locked.

Concurrent access to the log file can explain why the Winston logger fails to flush its internal data (string messages), in order to subsequently allow the CLI app to terminate gracefully.

Another cause (instead of concurrent file access) might be hiccups in the Windows filesystem (although in your case, I suspect NTFS? ... or is it a shared network drive, or a virtual machine of sorts?).

danielweck commented 5 years ago

@danielweck your proposal for improvement means that we should wait for a new release?

The current design of Ace "core" (CLI, as well as the modules used in Ace App / ace-gui) is not compatible with concurrent launches / parallel accessibility evaluations. But I also think that sequential launches can be affected by edge-case bugs related to the logging mechanism (for example, file handle released too soon, or as per your reports: not released at all / indefinite wait).

In fact, I disabled file logging in the Windows unit-tests, because of similar problems! https://github.com/daisy/ace/blob/72047a2423c17bf05f6621d418e1dc0668eb0e83/packages/ace-logger/src/index.js#L42

So, I think that the top priority should be to refactor the file logger / Winston integration. I think a LRU (Least Recently Used) "log rotate" strategy would work well (to avoid the excessive proliferation of log files in a given output folder). Each Ace launch would "own" its dedicated log file.

...and yes, I am taking on this development task :)

laudrain commented 5 years ago

@danielweck as we are still experiencing the issue, i have to make it clear that the main problem is node.js frizzing! it happens regularly (circa once per day) and the process has to be killed and relaunched, files to be put back in the queue.

The screenshot in the first comment of this issue illustrated exactly what happens:

laudrain commented 5 years ago

@danielweck the other issue#284 about ace.log file locked comes from wrong process killing when tentatively trying to stop this node.js freeze. I'll close #284.

laudrain commented 5 years ago

any idea about this node.js frizzing ?

danielweck commented 5 years ago

Hello Luc, could you please try npm install @daisy/ace@next (which is currently equivalent to npm install @daisy/ace@1.2.0-beta.2).

I imagine you install Ace globally (using the --global / -g NPM command line argument)? If so, you might need to uninstall your current version 1.1.1 (npm -g uninstall @daisy/ace and npm -g uninstall @daisy/ace-cli).

Version 1.2.0-beta.2 corresponds to this Pull Request: https://github.com/daisy/ace/pull/236

danielweck commented 5 years ago

the process has to be killed and relaunched, files to be put back in the queue.

Could you please explain how you use Ace? Do you launch a Node process per Ace CLI evaluation, or do you have a smarter wrapper (NodeJS script) that launches multiple evaluations in a sequence inside the same process?

laudrain commented 5 years ago

@danielweck Ace is launched through a batch command invoking the CLI version a)of Ace (see in the screenshot the highlighted cmd.exe).

danielweck commented 5 years ago

So, this batch script invokes a single Node process per Ace evaluation (i.e. a single Ace CLI launch), and the evaluations are sequenced, right? (no parallel concurrent executions)

This is a very important distinction, because in my test on Windows 10 Pro, I was able to reproduce bugs in PowerShell, GitBash, etc. consoles, where launching multiple Ace CLI instances / processes seems to be problematic (despite the completely isolated / individual log files, in the beta version)

laudrain commented 5 years ago

There are no parallel concurrent executions. Even in that case, we have seen a node.js freeze. For instance during the test last Sunday, node.js did freeze, and the server was doing only EPUB checking.

danielweck commented 5 years ago

Any news / positive developments @laudrain ?

Note that Ace 1.2.0-beta.3 has been published (npm install @daisy/ace@next): https://www.npmjs.com/package/@daisy/ace/v/1.2.0-beta.3 Notable changes: updated NPM package dependencies, including Axe 3.4.0 (but no new features or bug fixes in this intermediary release)

Source code: https://github.com/daisy/ace/pull/236 https://github.com/daisy/ace/tree/feature/axe-runner-electron

laudrain commented 5 years ago

@danielweck the production server has been upgraded in CPU and memory 10 days ago. We did not change Ace version and stayed in 1.1.1. We have been since monitoring freezing node.js events and none occurred. We think this issue is fixed, it was a configuration issue.

laudrain commented 5 years ago

On the Windows 2008 server, memory has been upgraded to 32Go. It is a 64bits system. There are less node.js lock but still some, with disturbing impact on our control chain. Do you know any way to report issues to node.js developer team ?

danielweck commented 5 years ago

Hello Luc, thank you for keeping us updated. What version of Ace CLI are you using on the Windows 2008 server where such problems occur?

laudrain commented 5 years ago

@danielweck Ace 1.1.1 Where should look to find clues about these freezes ?

danielweck commented 5 years ago

Hello, are you able to reproduce these problems with Ace CLI version 1.2.0-beta.5? https://www.npmjs.com/package/@daisy/ace/v/1.2.0-beta.5 (npm install @daisy/ace@next)

laudrain commented 5 years ago

We haven't checked thoroughly this beta on our test server. And on that test server, we never experienced such freeze issues...

laudrain commented 5 years ago

@danielweck I try to get from our internal integrators any system information (log, ...) for you when Ace freezes. What would be useful and where to find it in Ace installation?

laudrain commented 5 years ago

@rdeltour @danielweck still another freeze tonight ! We have deactivate Ace checks... Here attached the files we have gathered on the context.

The json file seems to be complete and it ends with a "pass": 9782246819325.json.zip

The launch command:

node "D:\logiciels\daisy-1.1.1\node_modules\@daisy\ace\bin\ace.js" -s "O:\vtar\Prod\controles\Attente_Verif\gsoft_office\9782246819325\9782246819325-E\EPUBREF\9782246819325.epub"

The ace.log file (see last timestamp "timestamp":"2019-12-04T02:14:23.248Z" ace.log

And a screenshot of the server processes taken today around 11:00, showing the node.js process started at 03:14 and freezed. image001

Would you mind reopen the issue?

danielweck commented 5 years ago

Hello Luc, yes we can of course re-open this issue, but please bear in mind that if the bug only occurs with Ace CLI version 1.1.1 - and not version 1.2.x (currently still published on NPM as a pre-release) - then it is likely that the fix basically consists in upgrading to the latest Ace CLI, which includes many updates of library dependencies, such as the logging framework, the Puppeteer Axe runner, etc. (which may be the reason why 1.1.1 suffers from issues).

This is why it would be great if you could try Ace CLI 1.2.x on the Windows server that is causing issues with 1.1.1. Otherwise - just looking at the Ace CLI logs - I do not think we will be able to diagnose the cause of process freezing. Personally, I suspect this has something to do with the logging framework and the filesystem, which is why I strongly recommend upgrading to Ace 1.2.x as there have been improvements in this area.

danielweck commented 5 years ago

Is NodeJS still version 8.8.1.0, and Windows 2008 R2? Is there a possibility to upgrade Node to LTS 12.13.1? https://nodejs.org

laudrain commented 5 years ago

@danielweck thanks, I did forward your advice.