Closed GreenRover closed 3 years ago
It seams 2 process try to get flock at the same time
hhvm 31171 www-data 6uw
hhvm 30994 www-data 6ur
Kill process 30994 will sove the issue for the moment. 31171 is a forked child of 30994 (php: pcntl_fork)
lsof | grep "/var/www/.hhvm.hhbc" | grep 'ur\|uw'
hhvm 24196 www-data 6ur REG 252,0 29359104 284931 /var/www/.hhvm.hhbc
hhvm 24196 www-data 10ur REG 252,0 29359104 284931 /var/www/.hhvm.hhbc
hhvm 24196 www-data 13ur REG 252,0 29359104 284931 /var/www/.hhvm.hhbc
hhvm 24387 www-data 6uw REG 252,0 29359104 284931 /var/www/.hhvm.hhbc
hhvm 24387 www-data 9uw REG 252,0 29359104 284931 /var/www/.hhvm.hhbc
hhvm 24387 www-data 10uw REG 252,0 29359104 284931 /var/www/.hhvm.hhbc
hhvm 24387 www-data 12u REG 252,0 1544 265587 /var/www/.hhvm.hhbc-journal
strace -p 24196
Process 24196 attached
restart_syscall(<... resuming interrupted call ...>) = 0
access("/var/www/.hhvm.hhbc-journal", F_OK) = 0
stat("/var/www/.hhvm.hhbc-journal", {st_mode=S_IFREG|0644, st_size=1544, ...}) = 0
fcntl(13, F_GETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741824, len=2, pid=24387}) = 0
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
lseek(13, 24, SEEK_SET) = 24
read(13, "\0\2B\374\0\0o\377\0\0\0\0\0\0\0\0", 16) = 16
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
access("/var/www/.hhvm.hhbc-wal", F_OK) = -1 ENOENT (No such file or directory)
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
fcntl(13, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1}) = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({1, 433000000}, NULL) = 0
access("/var/www/.hhvm.hhbc-journal", F_OK) = 0
stat("/var/www/.hhvm.hhbc-journal", {st_mode=S_IFREG|0644, st_size=1544, ...}) = 0
fcntl(13, F_GETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741824, len=2, pid=24387}) = 0
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
lseek(13, 24, SEEK_SET) = 24
read(13, "\0\2B\374\0\0o\377\0\0\0\0\0\0\0\0", 16) = 16
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
access("/var/www/.hhvm.hhbc-wal", F_OK) = -1 ENOENT (No such file or directory)
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
fcntl(13, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1}) = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({1, 434000000}, NULL) = 0
access("/var/www/.hhvm.hhbc-journal", F_OK) = 0
stat("/var/www/.hhvm.hhbc-journal", {st_mode=S_IFREG|0644, st_size=1544, ...}) = 0
fcntl(13, F_GETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741824, len=2, pid=24387}) = 0
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
lseek(13, 24, SEEK_SET) = 24
read(13, "\0\2B\374\0\0o\377\0\0\0\0\0\0\0\0", 16) = 16
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
access("/var/www/.hhvm.hhbc-wal", F_OK) = -1 ENOENT (No such file or directory)
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
fcntl(13, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1}) = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({1, 435000000}, NULL) = 0
access("/var/www/.hhvm.hhbc-journal", F_OK) = 0
stat("/var/www/.hhvm.hhbc-journal", {st_mode=S_IFREG|0644, st_size=1544, ...}) = 0
fcntl(13, F_GETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741824, len=2, pid=24387}) = 0
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
lseek(13, 24, SEEK_SET) = 24
read(13, "\0\2B\374\0\0o\377\0\0\0\0\0\0\0\0", 16) = 16
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
access("/var/www/.hhvm.hhbc-wal", F_OK) = -1 ENOENT (No such file or directory)
fstat(13, {st_mode=S_IFREG|0644, st_size=29359104, ...}) = 0
fcntl(13, F_SETLK, {type=F_WRLCK, whence=SEEK_SET, start=1073741825, len=1}) = -1 EAGAIN (Resource temporarily unavailable)
nanosleep({1, 436000000}, ^CProcess 24196 detached
<detached ...>
strace -p 24387
Process 24387 attached
restart_syscall(<... resuming interrupted call ...>) = 0
nanosleep({1, 526000000}, NULL) = 0
nanosleep({1, 527000000}, NULL) = 0
nanosleep({1, 528000000}, NULL) = 0
nanosleep({1, 529000000}, NULL) = 0
nanosleep({1, 530000000}, NULL) = 0
nanosleep({1, 531000000}, NULL) = 0
nanosleep({1, 532000000}, NULL) = 0
nanosleep({1, 533000000}, ^CProcess 24387 detached
<detached ...>
It seams to be a problem of sqllite http://stackoverflow.com/questions/20969996/is-it-safe-to-delete-sqlites-wal-file
Is it planed to use a real DBMS insted of sqlite?
I'm not aware of any such plans. Migrating to a full-fledged DBMS seems like overkill for this.
How would you solve this death look issue of sqllite wal? It seams not that sqllite want to fix this issue.
In the strace above you can see a flock for the not existing wal file.
I would say another hhvm process deleted wall file on exit sqlite_close
But i am not shure where the problem realy is.
OK, took a slightly closer look at this. I think the WAL file is a red herring. Here's part of the "new process" strace you gave:
open("/var/www/.hhvm.hhbc", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 6
fstat(6, {st_mode=S_IFREG|0644, st_size=29316096, ...}) = 0
lseek(6, 0, SEEK_SET) = 0
read(6, "SQLite format 3\0\4\0\1\1\0@ \0\2)x\0\0o\325"..., 100) = 100
fcntl(6, F_SETLK, {type=F_RDLCK, whence=SEEK_SET, start=1073741824, len=1}) = -1 EAGAIN (Resource temporarily unavailable)
That's all on fd 6 -- the file was opened, but we failed to lock it. Probably because another process already has the lock. The same is in your other strace -- trying to lock fd 13, and there isn't enough info in the strace to see what that fd actually is. (But all the opens on the wal/journal files fail, so it isn't trying to flock one of them, since we don't have an fd for it!)
So my suspicion is that you eventually get a ton of contention for the flock on the file because you have so many long-running HHVM processes running. Why do you have so many separate processes? Can you instead have a single process that you make web requests to, the standard server model, so only one has the DB open at a time? HHVM manages this fairly well, but wasn't really designed to have so many separate processes running at once. (You're doing a lot of duplicate work in each!)
If you really can't do that, it's possible to specify a different repo location per process. The INI setting is hhvm.repo.central.path
or hhvm.repo.local.path
, I don't recall the difference or which one you are supposed to set.
Hi, i dont have this much processes. 5-6 crons and a big one with pcntl_fork where the mother has the uw and the first client the ur status.
All other in the lsof list are cron jobs on hold since hours. because off the one uw.
Can you attach to the process holding the lock with gdb and run a backtrace, to figure out what it's doing in the nanosleep loop?
Hope that will help:
...
hhvm 28664 www-data 6ur REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28664 www-data 10ur REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28664 www-data 13ur REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28823 www-data 6u REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28823 www-data 9u REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28824 www-data 6uw REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28824 www-data 9uw REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28824 www-data 10uw REG 252,0 14202880 264029 /var/www/.hhvm.hhbc
hhvm 28824 www-data 12u REG 252,0 1544 268763 /var/www/.hhvm.hhbc-journal
....
gdb of 28824 http://pastebin.com/P5S4NJV5
gdb of 28664 http://pastebin.com/JvK7XjVs
gdb of 28823 http://pastebin.com/GWRqcmCE
For meantime i will go back to hhvm 3.0. The last stable release for me.
http://pastebin.com/JvK7XjVs is really interesting -- it looks like we were doing a sqlite operation, took a signal, ended up in bt_handler
, which then tried to do another sqlite operation. My suspicion is that the latter hung waiting for the operation before it took the signal to complete, which of course it never can.
The signal handler only gets called on a few "bad" exceptions, so it seems likely that some other crash is the root cause.
That's probably enough info to fix the proximate issue (re-entering sqlite in a signal handler). Not sure what other info we could get to find the root cause (why we hit our "bad" signal handler in the middle of a sqlite operation).
But in HHVM 3.0 and before i dont have this problem. So a change made since 3.0 must be the cause the problem.
Similar issue, about hhvm futex contention: https://github.com/facebook/hhvm/issues/5081
Very out dated
Hello, i have server with sever cron jobs runing by hhvm.
After a couple of hours, one process dont gives the flock on .hhvm.hhbc free.
Here a strace ofa new process
A lsof to see the lock
The strace of the process having the lock
ps from this process
How can i provide more usefull data? I am actualy not able to reproduce this issue. Only by let the server run a couple of hours.