Test failure /deadlock on running tests in parallel

skgbanga commented 6 years ago

On noticing that tests take kinda long time to run, I installed pytest-xdist to run the tests in parallel, and suddenly starting seeing failures and deadlocks.

Running it in single thread

/usr/bin/time pytest
...
...
======= 262 passed, 1 skipped, 4 warnings in 96.29 seconds 
5.63user 0.45system 1:36.69elapsed 6%CPU (0avgtext+0avgdata 120944maxresident)k
39136inputs+104outputs (73major+110586minor)pagefaults 0swaps

/usr/bin/time pytest -n auto
...
[gw1] [ 83%] PASSED tests/test_workers.py::test_bad_cpu 
[gw6] [ 84%] PASSED tests/test_workers.py::test_worker_cancel[run_in_process] 
[gw2] [ 84%] PASSED tests/test_workers.py::test_blocking

It just hangs on the above line.

Some of the failures in the tests were due to this:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/sandeep/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/sandeep/miniconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sandeep/workspace/curio/curio/monitor.py", line 117, in server
    sock.bind(self.address)
OSError: [Errno 98] Address already in use

dabeaz commented 6 years ago

Given that Curio is already doing a lot with event loops, threads, and processes all on its own, I'm not at all surprised that its test suite would fail in this kind of environment. I'm not sure there is an easy fix.

njsmith commented 6 years ago

"Address already in use" issues can generally be cleaned up by avoiding hard-coding addresses in tests, or even better letting the test bind an arbitrary open port and then passing that into the client. This is kinda tiresome, but certainly doable. The test_blocking thing would need to be debugged, of course. Or you could probably make the test suite faster by going through and removing calls to sleep, etc.; again, tiresome and requires diligence, but doable. You can pass --durations=10 to get pytest to list the slowest tests, which are a good place to start this kind of optimization.

In case it's any inspiration / competitive motivation, the trio test suite on my laptop currently takes 6.4 seconds to run 422 tests, single-threaded :-). (And with -n auto it passes, but takes 9.5 seconds... I guess the tests are so fast that the parallel scheduler overhead dominates.)

dabeaz commented 6 years ago

The clock stuff in Trio is pretty neat--that's something I should look at more sometime. I think the main issue for me is that the Curio test suite hasn't crossed my personal threshold of annoyance according to this chart https://xkcd.com/1205/

I'm definitely open to PRs or anything else that make it go faster though.

njsmith commented 6 years ago

https://xkcd.com/1205/

Right, the reason I've prioritized test speed is because (a) it's part of my evil plan to make it fun and easy for other people to contribute to trio, (b) it's part of my evil plan to make life more pleasant for trio's users, on the theory that if I insist on making trio's test suite run fast then it will force us to invent the tools our users will need to make their test suites fast. So I guess this is a cute example of how the projects' different emphases lead to all kinds of little different decisions...

dabeaz commented 6 years ago

Fortunately, if people heed my advice on not using Curio (even though they should be using Curio), they shouldn't have any code that needs testing (even though they probably will). So, er, well, I'm not sure what I'm saying exactly except that as an open source project, it's not practical for me to work on every possible problem all at once. This is definitely an area where Curio could be improved and where PRs would be welcome though.

On the plus side, the extra time spent running the Curio test suite will be more than made up for by Curio's performance running echo-server benchmarks. So, there is still that at least! ;-)

skgbanga commented 6 years ago

I created a PR for this.

Final timings:

(curio) $ pytest -n auto
-- Docs: http://doc.pytest.org/en/latest/warnings.html
======= 262 passed, 1 skipped, 4 warnings in 14.55 seconds

A reduction of 82 seconds per run. Not bad.

BUT I don't think this is the perfect fix. I ran the above code around 20 times, and once I did see a deadlock. :/ At that point, curio.monitor gave this:

curio > ps
Task   State        Cycles     Timeout Sleep   Task                                               
------ ------------ ---------- ------- ------- --------------------------------------------------
1      FUTURE_WAIT  1          None    None    Monitor.monitor_task                              
2      READ_WAIT    9          None    None    Kernel._run_coro.<locals>._kernel_task            
60     EVENT_WAIT   1          None    None    test_recv_timeout.<locals>.canceller

I don't think the above ps is helpful given that a lot of things are going on, but thought would mention it anyway.

imrn commented 6 years ago

On 5/25/18, David Beazley notifications@github.com wrote:

On the plus side, the extra time spent running the Curio test suite will be more than made up for by Curio's performance running echo-server benchmarks. So, there is still that at least! ;-)

Keeping curio simple is an effective way for protection from bugs. Even complexity in test suites increase it's foot print. They'll also make it harder to contribute. I don't think all async projects should have a mindset similar to trio. Fanciness is not a virtue.

dabeaz commented 5 years ago

I don't have a particular thought on running tests in parallel, but I have been bitten by tests running with the predefined port number (port 25000). So, I incorporated a pull request to fix that. If it fixes other things, all the better.

dabeaz / curio

Test failure /deadlock on running tests in parallel #264