ClusterLabs / libqb

libqb is a library providing high performance logging, tracing, ipc, and poll.
http://clusterlabs.github.io/libqb/
GNU Lesser General Public License v2.1
165 stars 97 forks source link

broken `test_ipc_max_dgram_size` test needs to be reviewed #234

Open jnpkrn opened 7 years ago

jnpkrn commented 7 years ago

https://travis-ci.org/ClusterLabs/libqb/jobs/178242766#L2722

../../tests/check_ipc.c:1498:F:ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==331264, try==425728

...triggered intermittently only with clang (3.4), upon unrelated change.

jnpkrn commented 7 years ago
Build image provisioning date and time
Thu Feb  5 15:09:33 UTC 2015

Operating System Details

Distributor ID: Ubuntu
Description:    Ubuntu 12.04.5 LTS
Release:    12.04
Codename:   precise

Linux Version
3.13.0-29-generic

Cookbooks Version
a68419e https://github.com/travis-ci/travis-cookbooks/tree/a68419e
jnpkrn commented 7 years ago

Verified this is indeed intermittent, the above link now points to a restarted run, which passed (note that the offset of the respective line is +3, unfortunately I haven't grabbed these 3 extraneous lines when it was possible, which might have shed more light into this, supposing they were related error messages).

jnpkrn commented 7 years ago

One of the possibilities that are hard to rule out is that parallel matrix builds (e.g., multiple compilers) share the same /dev/shm path (containers set up like that?) and it doesn't play very well in some rare circumstances as similar pseudorandom paths are being accessed...

chrissie-c commented 7 years ago

Very odd. I'm not going to worry about it short-term, though it would be useful to know how the test systems are set up. Can we reproduce it with clang ourselves?

jnpkrn commented 7 years ago

No cycles to spend on trying to reproduce that though we are now aware about this inclination in Travis CI so we'll have at least some clues when/if this recidivate.

jnpkrn commented 7 years ago

Some archeology:

jnpkrn commented 7 years ago

One more relevant hit: http://lists.corosync.org/pipermail/discuss/2013-May/002573.html

One quick thing to check is the location of your shared memory
I use travis ci for libqb and travis uses ubuntu vm's and I
know I had to do a workaround for the shared memory location
being moved from /dev/shm to /run/shm.

See: https://github.com/asalkeld/libqb/blob/master/.travis.yml

I'd suggest have a look at the output of:
mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)

df -h | grep shm
tmpfs                    3.9G  2.9M  3.9G   1% /dev/shm

and see if you need to run that workaround. (libqb tries /dev/shm
first).
jnpkrn commented 7 years ago

Regarding the relevance to Python implied with the cookbook references above, http://stackoverflow.com/a/30175343 seems to suggest it was to solve some kind of issue with multiprocessing module in Python's standard library.

jnpkrn commented 7 years ago

(see also #238)

jnpkrn commented 7 years ago

Diagnostic enhancement from #238 shed some more light here:

../../tests/check_ipc.c:1506:F:ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==0x50e00, try==0x67f00, i=28, errno=90

where errno of 90 means EMSGSIZE (Message too long).

One of the possibities is that some assumption that used to hold so far (per the previous successful test runs) is actually unreliable in practice and some factors of Travis environment just make it easier to prove it.

jnpkrn commented 7 years ago

Another hit:

init==0x50e00, try==0x67f00, i=40, errno=90

From the diagnostics added so far, it seems that /dev/shm mounted as tmpfs is quite small, just 64 MB, if it could be a culprit.

jnpkrn commented 7 years ago

... PR #242 might help regarding this hypothesis.

jnpkrn commented 7 years ago

Just got a report with occurrence of this issue on virtualized s390x:

ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==331264, try==331776

Mere 495M was allocated to /dev/shm.

chrissie-c commented 7 years ago

It's testing socket buffers rather than SHM arenas so it might be a ulimit issue. Odd that it failed there though because that's comparing the reported maximum with the actual allocated!