dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 720 forks source link

[CI] Warnings about used ports #4806

Open fjetter opened 3 years ago

fjetter commented 3 years ago

Lately we are observing hundreds of warnings about the dashboard ports being occupied. While I understand that we occasionally may have an overlap, I would expect every test to close the HTTP server properly such that these overlaps happen only infrequently. I can't help but wonder if this connects to our recent CI instabilities and/or CI performance.

Example https://github.com/dask/distributed/pull/4805/checks?check_run_id=2556648953

``` distributed/tests/test_multi_locks.py::test_multiple_locks /home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 46611 instead warnings.warn( distributed/tests/test_multi_locks.py::test_num_locks /home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 46079 instead warnings.warn( distributed/tests/test_nanny.py::test_lifetime /home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 37239 instead warnings.warn( distributed/tests/test_nanny.py::test_nanny_closes_cleanly ```
jrbourbeau commented 3 years ago

Thanks for raising an issue about this @fjetter -- I agree it seems like there are a lot more of these warnings lately. Perhaps there's a new test which leaves a server running and this impacts all the subsequent tests? Regardless of why, someone going through and resolving these warnings would be very welcome

martindurant commented 3 years ago

The top of the warnings:

distributed/comm/tests/test_ws.py::test_http_and_comm_server[False-wss://-security3-8787]
  /home/runner/work/distributed/distributed/distributed/comm/ws.py:318: RuntimeWarning: Dashboard and Scheduler are using the same server on port 8787, defaulting to the Scheduler's ssl context. Your dashboard could become inaccessible
    warnings.warn(

distributed/dashboard/tests/test_scheduler_bokeh.py::test_NBytesHistogram
  /home/runner/work/distributed/distributed/distributed/dashboard/utils.py:57: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
    if source.data == data:

distributed/deploy/tests/test_local.py::LocalTest::test_context_manager
  /home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 42851 instead
    warnings.warn(

distributed/deploy/tests/test_local.py::LocalTest::test_no_workers
  /home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use.
  Perhaps you already have a cluster running?
  Hosting the HTTP server on port 33321 instead
    warnings.warn(

Perhaps the first in this list is the cause; or the first appearance in distributed/deploy/tests/test_local.py::LocalTest::test_context_manager gives us an idea of where the actual problem might be.

fjetter commented 3 years ago

Simple test to reproduce

def test_init_twice_no_warning():
    with pytest.warns(None) as records:
        for _ in range(2):
            Scheduler()
    assert not records

and is caused by https://github.com/dask/distributed/blob/0a14902ecc88938c30e48e22f0bc1478f47bba7d/distributed/scheduler.py#L3367 The scheduler init is already opening a socket to listen on that port regardless of whether the server is "started" or not. That pops up in tests like https://github.com/dask/distributed/blob/0a14902ecc88938c30e48e22f0bc1478f47bba7d/distributed/tests/test_scheduler.py#L319-L322 which do not perform the proper scheduler lifecycle. It will not show up with the ctx manager cleanup since the HTTP server does actually not spawn any further threads since it is a tornado async http service. It simply binds the socket and this is something we do not check.

That has been around for ages but the warning has been introduced only a few weeks ago in https://github.com/dask/distributed/pull/4786

proper fix is probably to move the start of the http services to the start method of the scheduler and ensure it is closed and stopped in the close properly. That requires a bit of refactoring which is why I haven't opened a PR, yet. Happy for somebody else to pick this up from here since I am currently swamped with other issues

mrocklin commented 3 years ago

The improper/short-term solution would be to convert that test to be async and close the scheduler when we're done in a finally block.

We might also include a check for warnings in our testing infrastructure

martindurant commented 3 years ago

Could we have a fixture that watches for processed called "dask-*" and kills them before each test? Or maybe the opposite, a fixture that errors if you a process still exists after a test finishes.

We might also include a check for warnings in our testing infrastructure

There are many warnings from all sorts of origins, e.g., the ever-changing set of dependencies and their deprecations.

martindurant commented 3 years ago

(sorry, reading back, I see this is not necessarily due to other processes)

fjetter commented 3 years ago

There are many warnings from all sorts of origins, e.g., the ever-changing set of dependencies and their deprecations.

That would require a bit more effort in maintenance but these could be all either dealt with or ignored in our config. I'm not opposed to the "check for warnings" approach since we have tons of warnings connected to our own code nobody is actually looking into

The improper/short-term solution would be to convert that test to be

Another improper solution ,at least for the log spam, would be to simply ignore the warnings in CI. That can easily be done with a CLI argument (or config) -Wignore:Port but I'd assume we would never fix it then :)

martindurant commented 3 years ago

Another improper solution ,at least for the log spam, would be to simply ignore the warnings in CI. That can easily be done with a CLI argument (or config) -Wignore:Port but I'd assume we would never fix it then

I think it is a fair question: what feedback do we actually want from the CI? A long list of PASSED is not very useful either; only the list of failed tests and (optionally) their tracebacks. Similarly, pytest can make a warnings summary.

fjetter commented 3 years ago

For catching all tests which leave a dangling HTTP server, one can use a fixture like this

@pytest.fixture(autouse=True)
def ensure_http_server_closed():
    yield
    from distributed import Scheduler

    for s in Scheduler._instances:
        assert s.http_server._stopped

Previous version:

@pytest.fixture(autouse=True)
def ensure_http_server_closed():
    from tornado.ioloop import IOLoop
    from distributed.comm import registry
    from distributed.comm.addressing import parse_address

    yield

    loop = IOLoop.current()
    async def _():
        with pytest.raises(CommClosedError, match="Connection refused"):
            address = "127.0.0.1:8787"
            scheme, loc = parse_address(address)
            backend = registry.get_backend(scheme)
            connector = backend.get_connector()
            comm = await connector.connect(address)
            comm.abort()
    loop.run_sync(_)

This is not very robust which is why I don't want to put this into our actual CI but in case this problem pops up again this might be of help for future-us