Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes

toddr commented 4 years ago

I've noticed this failure on several CI tests which were completely unrelated to this test.

2020-09-13T19:19:46.2191182Z sh: line 1:  6654 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6570A" 2>&1
2020-09-13T19:19:47.1516660Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1065
2020-09-13T19:19:47.1517926Z #      got "# Test process timed out - terminating\nkill: 6654: No such process"
2020-09-13T19:19:47.1518573Z # expected "Timeout"
2020-09-13T19:19:47.1518994Z # PROG: 
2020-09-13T19:19:47.1519346Z # 
2020-09-13T19:19:47.1519921Z #                 BEGIN{require q(test.pl);}
2020-09-13T19:19:47.1520517Z #                 watchdog(3);
2020-09-13T19:19:47.1521025Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-13T19:19:47.1521437Z #                 alarm 1;
2020-09-13T19:19:47.1521838Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-13T19:19:47.1522235Z #                 /.*a.*b.*c.*[de]/;
2020-09-13T19:19:47.1522593Z #             
2020-09-13T19:19:47.1522984Z # STATUS: 35072
2020-09-13T19:19:58.3677295Z t/re/pat ....................................................... FAILED at test 955

Re-running the entire workflow usually does not fail.

toddr commented 4 years ago

Seen in a chat. I think this was the same problem:

I have a test VM reproducing the cygwin failures that keep showing up on github. It’s the test on line 1905 of t/re/pat.t. I believe it’s either the signals from the alarms showing up out of order, or the kill of the watcher process just unblocking it from its sleep(), or the print statements showing up out of order. Increasing the timeout of the watchdog process to 10 seconds decreases the liklihood significantly, but doesn’t eliminate it entirely. Need to dig into it more when I have free time.

It might fix the problem if the watchdog process used a select() loop for it’s timeout rather than sleep() + alarm(). I’d bet it’s just an oddity of the way signals are implemented on cygwin.

wchristian commented 4 years ago

There's two bugs here:

For some reason sometimes alarm doesn't fire in time, and the watchdog fires instead. This is the actual bug. I'm presuming performance reasons right now.
For some reason the cygwin kill logic is silly. As you can see, the system kill is called unconditionally on cygwin, without checking if the previous kill worked or not. This results in the watchdog termination sometimes ALSO having a spurious error message which should however either: Be ignored, or the logic be augmented to check if the process still exists before trying to kill it. NOTE: This doesn't address why the process is being eaten by the watchdog, it just addresses the kill error, which is really ... irrelevant.
```
        # Kill test process if still running
        if (kill(0, $pid_to_kill)) {
            _diag($timeout_msg);
            kill('KILL', $pid_to_kill);
            if ($is_cygwin) {
                # sometimes the above isn't enough on cygwin
                sleep 1; # wait a little, it might have worked after all
                system("/bin/kill -f $pid_to_kill");
            }
        }
```

lightsey commented 4 years ago

I spent some time a while back trying to work around this buggy behavior. I believe the introduction of ConPTY support in Cygwin made it unreliable. It seems to disappear so long as the cygwin server is enabled and the cygwin terminal process is started with CYGWIN=disable_pcon set.

There's also a seemingly related problem in the way S_exit_warning() vetos the Perl interpreter cleanup and emits warnings when a child thread calls exit(). In a nutshell, there's no reason to veto the cleanup if all of the threads are also in a joinable state. This seems to happen frequently on cygwin when a child thread calls exit().

The top three commits in this branch have the code I was experimenting with to fix this problem, but I didn't figure out a way to get the cygwin shell to start properly with disable_pcon set.

https://github.com/lightsey/perl5/tree/cygwin-threads

toddr commented 4 years ago

The relevant attempt seems to have been https://github.com/Perl/perl5/commit/26eacad2e188e46ec495697b653e2d381e0ce162

toddr commented 4 years ago

I feel like we should put a skip in this unless someone wants to come up with a fix. IMO it's not cool when all our commits are decorated red because of this failure. Opinions?

wchristian commented 4 years ago

You could try increasing the timeout first to see if that helps.

Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough?

toddr commented 4 years ago

You could try increasing the timeout first to see if that helps.

It helps but only changes the rate of failure. It does not fix anything. See: https://github.com/Perl/perl5/issues/18129#issuecomment-691722685

Also please do fix the secondary bug regardless. I thought my explanation was fairly thorough?

Can you produce a pull request?

wchristian commented 4 years ago

I saw the comment. It refers to their test VM. We have no data on what, e.g. a 30 second watchdog would do on the github VM.

Can you produce a pull request?

I'd be happier to do it if i had less rocks in my way, but sure, i'll produce initial PRs.

lightsey commented 4 years ago

A simple way of reproducing the critical part of the bug is this oneliner:

PERL5LIB=`pwd`/lib ./perl.exe -e 'no warnings; use threads; threads->create(sub {exit(99)});sleep(5);exit(98);' 2>/dev/null; echo $?

On Linux that will reliably show an exit code of 99. On Cygwin is will frequently show an exit code of 0. If you strace it on Cygwin it will start reliably showing an exit code of 99. If you remove the 2>/dev/null and allow the warnings to reach the console on Cygwin it will generally show an exit code of 99.

The same logic implemented in C with pthreads works reliably on Cygwin and Linux.

Using the version of Perl 5.30.3 that comes with cygwin also works reliably, but building Perl 5.30.3 from source on Cygwin shows this buggy behavior.

I'm at a loss to explain why Perl misbehaves in this scenario, but it seems to be caused by a fairly tight race condition in the way the exit codes are handled when a child thread exits. It also seems like either something in the cygwin patches for Perl or the cygwin package build toolchain fixes it.

The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell.

lightsey commented 4 years ago

It's also worth noting that from what I've seen, starting the cygwin SSH daemon and connecting over SSH seems to fix the behavior, and starting the Cygwin shell with CYGWIN=disable_pcon set also seems to fix it. Unfortunately, that doesn't seem to work in the github workflow file and doesn't explain why the version of Perl packaged with Cygwin doesn't exhibit the same issue.

It could be that some of these variations just make the Perl exit logic run slightly faster or slower and avoid whatever race condition is the root cause of this bug.

wchristian commented 4 years ago

I had sent you some stuff on IRC as per toddr.

Anyhow, if you're on it and not letting this be closed with a simple skip, yay.

Edit:

The timeouts and oddities of how the watchdog() code works are red herrings from what I can tell.

Yes, but they can work around the issue so it remains usable until such time as a fix is found. They are not not now and were never intended to be the fix. Plus the watchdog oddity with kill was a legit bug.

lightsey commented 4 years ago

My testing showed that increasing the timeout didn't eliminate the flapping behavior. I bet the simplest way to avoid the flapping tests would be one of these options...

disable cygwin in the smoke tests
compile without threads on cygwin for the smoke tests
rewrite the watchdog() code to use forks on cygwin
rewrite the failing tests to look for a zero exit code rather than a non-zero exit code

I'm at a loss for how to fix the thread bug itself. I asked Achim Gratz (the cygwin perl package maintainer) to take a look at this issue.

wchristian commented 4 years ago

@lightsey I have not been able to reproduce with the snippet you gave above.

https://gist.github.com/wchristian/ca33e766c73b82f9fd3026abad1e7634

wchristian commented 4 years ago

We tried reproducing it on my system, with the check PASSing every time:

https://gist.github.com/wchristian/ad4971132939899e4c52f310d830cfc1

Trying to see with watchdog changes now:

https://github.com/wchristian/perl5/actions/runs/260042599 https://github.com/wchristian/perl5/actions/runs/260048164

Also gonna add a test that runs the snippet a thousand times to get a better idea of what actually happens.

wchristian commented 4 years ago

So in this branch adding a longer timeout seems to have improved things.

https://github.com/wchristian/perl5/actions/runs/260165383

jkeenan commented 4 years ago

FWIW, a recent github action run on cygwin in a "Perl 7" strict-by-default environment came up with this error:

2020-09-17T22:49:50.6092178Z sh: line 1:  6834 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6750A" 2>&1
2020-09-17T22:49:51.5535604Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1069
2020-09-17T22:49:51.5539809Z #      got "# Test process timed out - terminating\nkill: 6834: No such process"
2020-09-17T22:49:51.5541105Z # expected "Timeout"
2020-09-17T22:49:51.5541705Z # PROG: 
2020-09-17T22:49:51.5542065Z # 
2020-09-17T22:49:51.5542457Z #                 BEGIN{require q(test.pl);}
2020-09-17T22:49:51.5542907Z #                 watchdog(3);
2020-09-17T22:49:51.5543358Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-17T22:49:51.5543776Z #                 alarm 1;
2020-09-17T22:49:51.5544368Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-17T22:49:51.5544882Z #                 /.*a.*b.*c.*[de]/;
2020-09-17T22:49:51.5545233Z #             
2020-09-17T22:49:51.5545563Z # STATUS: 35072
2020-09-17T22:50:02.1976770Z t/re/pat ....................................................... FAILED at test 955

From: https://pipelines.actions.githubusercontent.com/StwKAELrIIfJZjsjvFeJdim2u3AOp8ez6qvldOETh5vCDDWoz5/_apis/pipelines/1/runs/780/signedlogcontent/32?urlExpires=2020-09-18T00%3A11%3A44.6877392Z&urlSigningMethod=HMACV1&urlSignature=H0qYt4FXy2QKhfJVmRkN2gAngBWtZR82d4Wc5lBv4H4%3D

wchristian commented 4 years ago

@jkeenan that link's dead

edit: also you wanna cherry-pick https://github.com/wchristian/perl5/commit/201536353020e10d804d56e7b50acdfd5454481b

jkeenan commented 4 years ago

@jkeenan that link's dead

2020-09-17T22:49:50.6092178Z sh: line 1:  6834 Killed                  /home/runneradmin/work/perl.exe "-I../lib" "-I." "tmp6750A" 2>&1
2020-09-17T22:49:51.5535604Z # Failed test 955 - Test Perl 73464 at ./test.pl line 1069
2020-09-17T22:49:51.5539809Z #      got "# Test process timed out - terminating\nkill: 6834: No such process"
2020-09-17T22:49:51.5541105Z # expected "Timeout"
2020-09-17T22:49:51.5541705Z # PROG: 
2020-09-17T22:49:51.5542065Z # 
2020-09-17T22:49:51.5542457Z #                 BEGIN{require q(test.pl);}
2020-09-17T22:49:51.5542907Z #                 watchdog(3);
2020-09-17T22:49:51.5543358Z #                 $SIG{ALRM} = sub {print "Timeout\n"; exit(1)};
2020-09-17T22:49:51.5543776Z #                 alarm 1;
2020-09-17T22:49:51.5544368Z #                 $_ = "a" x 1000 . "b" x 1000 . "c" x 1000;
2020-09-17T22:49:51.5544882Z #                 /.*a.*b.*c.*[de]/;
2020-09-17T22:49:51.5545233Z #             
2020-09-17T22:49:51.5545563Z # STATUS: 35072
2020-09-17T22:50:02.1976770Z t/re/pat ....................................................... FAILED at test 955

wchristian commented 4 years ago

Very odd, on one of my smoke runs a github vm saw this failure:

t/op/alarm ..................................................... FAILED--expected 5 tests, saw 0

wchristian commented 4 years ago

And @jkeenan, yeah, that's the exact same error we're seeing here. Seems to be caused by github VMs, on account of not occurring with cygwin on iron. Also you wanna cherry-pick wchristian@2015363

wchristian commented 4 years ago

I've implemented a test that should fail every time under cygwin on github vms, and not take too long on other systems:

https://github.com/wchristian/perl5/commit/e8eafd916f8f5b1e7df46865be5198f0c4c58909

If that is at 100% fail we can mark it as TODO.

wchristian commented 4 years ago

PR made: https://github.com/Perl/perl5/pull/18149

wchristian commented 4 years ago

The pull request earlier seems to have successfully stopped re/pat.t from flapping, but instead fails consistently as a TODO.

This means the specific bug is now:

on Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes

wchristian commented 4 years ago

This one also needs the mswin32 label replaced with the distro-cygwin one.

Perl / perl5

Github Windows VMs, alarm() sometimes fails to fire at all during long-running regexes #18129