cloudius-systems / osv

OSv, a new operating system for the cloud.
osv.io
Other
4.1k stars 602 forks source link

tst-namespace failure #897

Open nyh opened 7 years ago

nyh commented 7 years ago

On the Jenkins nightly build we saw a crash in the tst-namespace test: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1183/console

The output is:

OSv v0.24-418-g8ab5c77
eth0: 192.168.122.15
Assertion failed: sched::preemptable() (arch/x64/mmu.cc: page_fault: 33)

[backtrace]
0x0000000000225898 <__assert_fail+24>
0x00000000003872b6 <page_fault+294>
0x00000000003860a6 <???+3694758>
0x00000000003d134f <???+4002639>
0x00000000003d1636 <malloc+70>
0x0000000000291ca8 <kmem_cache_alloc+72>
0x0000000000322cc4 <???+3288260>
0x0000000000323591 <zio_write_phys+65>
0x00000000002f73b7 <???+3109815>
0x00000000002f75f8 <???+3110392>
0x00000000002f90f4 <vdev_uberblock_sync_list+116>
0x00000000002f93d4 <vdev_config_sync+276>
0x00000000002e45ef <spa_sync+1167>
0x00000000002f0684 <???+3081860>
0x00000000003e3a26 <thread_main_c+38>
0x0000000000387022 <???+3698722>

The failure is superficially similar to #790 in that tst-namespace.so was crashing with a page fault there too, but that bug was fixed and I think the (dubious) stack trace here is very different, so it's a different cause.

nyh commented 7 years ago

I ran this test 1,000 times on my own laptop, and could not reproduce this problem. Unfortunately, tonight we got another crash in a different test - tst-nway-merger - see http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1184/console, I don't know if it's related...

I don't know if this is a regression in our code that caused all the failures in recent days' tests, or something changed in the build environment on the Jenkins machine.

nyh commented 7 years ago

The same tst-nway-merger crash as we saw today also happened on a nightly build a week ago: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1179/console So it might be things introduced in that commit, 05baa614c460f7033828adf926ce0224df40fd8f, but more likely it is NOT: On that day, many tests have been failing for 4 days because of the Jenkins machine switch to the "gold" linker, so if this is a regression in our code, the regression might have been introduced up to 4 days earlier. Or, perhaps the switch to the gold linker is what is causing (or exposing) these bugs?

nyh commented 7 years ago

After this test succeeded for several, tst-namespace failed again: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1197/ So evidently, the failure only happens sporadically... (but never on my own development machine...)

nyh commented 7 years ago

tst-nway-merger crash again: http://jenkins.cloudius-systems.com:8080/job/osv-build/1275/console

nyh commented 7 years ago

And namespace: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1206/console :-( Houston, we have a problem....

nyh commented 7 years ago

Another tst-namespace failure in tonight's build: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1222/

nyh commented 7 years ago

Another tst-namespace crash inside a different place in ZFS code. This time it is a general_protection(), not a page_fault(). http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1243/console

nyh commented 7 years ago

Another nightly failure, page fault in ZFS code in tst-namespace.so. http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1247/console

nyh commented 7 years ago

Another nightly failure: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1254/console

nyh commented 7 years ago

And another: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1257/console

nyh commented 6 years ago

After a long time not seeing this bug I was starting to hope maybe it went away as mysteriously as it came (?) butt unfortunately, it didn't: http://jenkins.cloudius-systems.com:8080/job/osv-build-nightly/1308/console

nyh commented 6 years ago

Failed again: https://jenkins.scylladb.com:8443/job/osv-build-nightly/1317/console

wkozaczuk commented 6 years ago

As I was working on unit tests for readonly FS I noticed that each test is run independently with one of the options - "--unsafe-cache" which maps to qemu options - cache=unsafe,aio=threads which I think speeds up QEMU startup and shutdown dramatically but at the cost of ignored flush instructions from guest. Per QEMU (https://www.suse.com/documentation/sles11/book_kvm/data/sect1_1_chapter_book_kvm.html):

"cache = unsafe This mode is similar to the cache=writeback mode discussed above. The key aspect of this unsafe mode, is that all flush commands from the guests are ignored. Using this mode implies that the user has accepted the trade-off of performance over risk of data loss in the event of a host failure. Useful, for example, during guest install, but not for production workloads."

Given that all unit test operated on the same image with ZFS and some of them modify data on filesystem it might lead to some corrupt ZFS state that maybe on next unit test OSv tries to repair?

nyh commented 6 years ago

Failed again: https://jenkins.scylladb.com:8443/job/osv-build-nightly/1334/console

@wkozaczuk about your guess: I would hope (but never fully verified) that even with "cache = unsafe", when qemu shuts down it does flush everything, As you see in that documentation, they suggest to use it for "guest install", for example. And if that were the problem, why do we always see it in the same test and not in other tests?

nyh commented 6 years ago

Failed again: https://jenkins.scylladb.com:8443/job/osv-build-nightly/1342/console

nyh commented 6 years ago

Failed again, in a similar way but (not for the first time) in tst-nway-merger: https://jenkins.scylladb.com:8443/job/osv-build-nightly/1346/console

nyh commented 6 years ago

Another failure. Unlike previous crashes, I don't see an allocation in the stack trace. But it's still in ZFS. https://jenkins.scylladb.com:8443/job/osv-build-nightly/1348/console