clinicjs / node-clinic

Clinic.js diagnoses your Node.js performance issues
https://clinicjs.org
MIT License
5.68k stars 125 forks source link

test error: 'could not converge HMM model" #410

Open goto-bus-stop opened 5 years ago

goto-bus-stop commented 5 years ago

This error came up on CITGM: https://ci.nodejs.org/view/Node.js-citgm/job/citgm-smoker/1773/nodes=debian8-64/testReport/junit/(root)/citgm/clinic_v4_0_0/ (Ctrl+F for "not ok")

It seems flakey: there are multiple tests in that same suite that do almost the same thing, but only one of them failed. I also can't locally reproduce it.

We figured it may help to increase the test duration: https://github.com/nearform/node-clinic/pull/145 Since then, I don't believe it's happened again yet, but it's hard to know if that means it's fixed :smile:

AndreasMadsen commented 5 years ago

Hard to say without knowing the data. My guess is that the program runs for such a short time, that no GC is really happening and thus the differences between the two patterns it is trying to find is very small. This causes some minor issues converting convergining within 100 iterations.

Is this actually likely to be fixed by having a larger sample?

Yes, GC is more likely to happen. The patterns will be easier to detect from more data.

Or is there some other tweak that may likely fix this?

We can tweak the tolearance threshold, in https://github.com/nearform/node-clinic-doctor/blob/10432d8839e223f27313d88dc8b90ca099fae041/analysis/analyse-cpu.js#L62 but I think it is a bad idea to tweak things based on CI results. Any statisticical appraoch is always going to have failure points, otherwise it is determinisme i.e. not statistics. If those failure points only exists in CI, that is unbelievably amazing. I think it is better to make the CI tests more realistic than it is to make the error conditions less strict.

goto-bus-stop commented 5 years ago

That's what I was afraid of :( I agree that tweaking that condition would be bad, IIRC the CITGM failure came up with a tolerance value of like 0.13 which is way too big a difference. It's a bit confusing still to me, because for this particular test we used autocannon for 1sec on a web server that immediately replies with 'ok', but some of our other tests are only a setTimeout() with no load at all and they don't crash. I'd expect both to have fairly similar GC activity (none at all).

we should probably exit with a "Not enough data" message instead of crashing with the HMM error then, and extend the tests