dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

AlertGenerator test can take 1 hour+ (and fail) #2238

Closed stuartw closed 12 years ago

stuartw commented 13 years ago

See http://dmwm.cern.ch:8080/job/WMCore-py2.6-mysql/87/testReport/WMComponent_t.AlertGenerator_t.Pollers_t.System_t/

(look at ContinuousIntegration to get a login)

Any idea whats going wrong?

Also see a current build which is also hanging (probably finished by the time you read this)

http://dmwm.cern.ch:8080/job/WMCore-py2.6-mysql/92/console

zdenekmaxa commented 13 years ago

maxa: No, I haven't got the faintest clue.

In essence, I want to dig deeper, but appreciate some assistance with running on the machine.

zdenekmaxa commented 13 years ago

maxa: I have some more evidence that run 87 might have been failed by stuff from run 88 (89 was ok again). Also, still current 92 has been showing for hours this: {{{ WMComponent_t.AlertGenerator_t.Pollers_t.Couch_t.CouchTest.testAlertGeneratorCouchDbSizePollerNoAlert -- testAlertGeneratorCouchDbSizePollerNoAlert (WMComponent_t.AlertGenerator_t.Pollers_t.Couch_t.CouchTest) ... ok }}}

is it possible that 87, 88 ran in parallel while it's not the case with the current 92 and future 93? Would be extremely helpful to know which particular test it's hanging on now. And even it really is WMComponent_t/AlertGenerator_t/Pollers_t/System_t.py (the link only shows the last successful one), I would not understand why it's happening unless I can reproduce it, the best on the jenkins machine.

stuartw commented 13 years ago

swakef: They are run one at a time, however it looks like something isn't cleaning up properly {{{ jenkins 13092 0.0 2.0 183268 41520 ? Sl Aug31 0:03 python setup.py test --buildBotMode=true --reallyDeleteMyDatabaseAfterEveryTest=true jenkins 13347 0.0 1.8 183260 37432 ? Sl Aug31 0:00 python setup.py test --buildBotMode=true --reallyDeleteMyDatabaseAfterEveryTest=true }}}

I'll try to give you access tomorrow.

ps. I dropped jenkins down to one builder agent as it looks like 2 builds can go in parallel

stuartw commented 13 years ago

swakef: Does this test hang indefinitely if no message is received? {{{

wait to poller to work now ... wait for alert to arrive

    if expected != 0:
        while len(handler.queue) == 0:
            time.sleep(config.pollInterval / 5)
    else:
        time.sleep(config.pollInterval \* 2)

}}} If so can you break out of this after a reasonable time (the shorter the better) and fail the test.

zdenekmaxa commented 13 years ago

maxa: Fishy things going on. Turning on more debug output (as of yesterday - the !TestInit issue #2247), I can actually reproduce this behaviour locally (equally with zmq 2.1.7, 2.1.9 btw.). The test hanging never happens without debugging output. The poller (running as multiprocessing.Process) at some point stops polling causing the expected alert message never arrives. There is a quick fix now failing the test if this waiting exceeds 2 minutes. Will continue trying to understand it.

Patch attached, please review.

stuartw commented 13 years ago

swakef: Patch references #1981, is this the correct patch?

zdenekmaxa commented 13 years ago

maxa: sorry, it was wrong patch, this is the correct one: patch-series-maxa-alerts-fw-poller_system-testissues-01

zdenekmaxa commented 13 years ago

maxa: patch-series-maxa-alerts-fw-poller_system-testissues-02 (no need to apply the previous one first).

The issues is in logging from multiprocessing.Process contexts. The process remains hanging. The patch removes troublesome logging.something calls, though a better solution is needed (#2258). After this, I haven't experienced a poller process stalled any more. Curious about jenkins runs.

Please review.

stuartw commented 13 years ago

swakef: (In 93df74ada55df91c6266787316b11e7d9cfba327) Fail AlertGenerator tests if alert not received in a reasonable time

Fixes #2238

From: Zdenek Maxa zdenek.maxa@hep.caltech.edu Signed-off-by: Stuart Wakefield stuart.wakefield@imperial.ac.uk

stuartw commented 13 years ago

swakef: (In 8105bdc380d77883e3c4f37a7ac8f626449e20af) Remove logging from alert pollers (causes hang).

Fixes #2238

From: Zdenek Maxa zdenek.maxa@hep.caltech.edu Signed-off-by: Stuart Wakefield stuart.wakefield@imperial.ac.uk