Open fjetter opened 3 years ago
Thanks for raising an issue about this @fjetter -- I agree it seems like there are a lot more of these warnings lately. Perhaps there's a new test which leaves a server running and this impacts all the subsequent tests? Regardless of why, someone going through and resolving these warnings would be very welcome
The top of the warnings:
distributed/comm/tests/test_ws.py::test_http_and_comm_server[False-wss://-security3-8787]
/home/runner/work/distributed/distributed/distributed/comm/ws.py:318: RuntimeWarning: Dashboard and Scheduler are using the same server on port 8787, defaulting to the Scheduler's ssl context. Your dashboard could become inaccessible
warnings.warn(
distributed/dashboard/tests/test_scheduler_bokeh.py::test_NBytesHistogram
/home/runner/work/distributed/distributed/distributed/dashboard/utils.py:57: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if source.data == data:
distributed/deploy/tests/test_local.py::LocalTest::test_context_manager
/home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42851 instead
warnings.warn(
distributed/deploy/tests/test_local.py::LocalTest::test_no_workers
/home/runner/work/distributed/distributed/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 33321 instead
warnings.warn(
Perhaps the first in this list is the cause; or the first appearance in distributed/deploy/tests/test_local.py::LocalTest::test_context_manager
gives us an idea of where the actual problem might be.
Simple test to reproduce
def test_init_twice_no_warning():
with pytest.warns(None) as records:
for _ in range(2):
Scheduler()
assert not records
and is caused by https://github.com/dask/distributed/blob/0a14902ecc88938c30e48e22f0bc1478f47bba7d/distributed/scheduler.py#L3367 The scheduler init is already opening a socket to listen on that port regardless of whether the server is "started" or not. That pops up in tests like https://github.com/dask/distributed/blob/0a14902ecc88938c30e48e22f0bc1478f47bba7d/distributed/tests/test_scheduler.py#L319-L322 which do not perform the proper scheduler lifecycle. It will not show up with the ctx manager cleanup
since the HTTP server does actually not spawn any further threads since it is a tornado async http service. It simply binds the socket and this is something we do not check.
That has been around for ages but the warning has been introduced only a few weeks ago in https://github.com/dask/distributed/pull/4786
proper fix is probably to move the start of the http services to the start
method of the scheduler and ensure it is closed and stopped in the close
properly. That requires a bit of refactoring which is why I haven't opened a PR, yet. Happy for somebody else to pick this up from here since I am currently swamped with other issues
The improper/short-term solution would be to convert that test to be async and close the scheduler when we're done in a finally block.
We might also include a check for warnings in our testing infrastructure
Could we have a fixture that watches for processed called "dask-*" and kills them before each test? Or maybe the opposite, a fixture that errors if you a process still exists after a test finishes.
We might also include a check for warnings in our testing infrastructure
There are many warnings from all sorts of origins, e.g., the ever-changing set of dependencies and their deprecations.
(sorry, reading back, I see this is not necessarily due to other processes)
There are many warnings from all sorts of origins, e.g., the ever-changing set of dependencies and their deprecations.
That would require a bit more effort in maintenance but these could be all either dealt with or ignored in our config. I'm not opposed to the "check for warnings" approach since we have tons of warnings connected to our own code nobody is actually looking into
The improper/short-term solution would be to convert that test to be
Another improper solution ,at least for the log spam, would be to simply ignore the warnings in CI. That can easily be done with a CLI argument (or config) -Wignore:Port
but I'd assume we would never fix it then :)
Another improper solution ,at least for the log spam, would be to simply ignore the warnings in CI. That can easily be done with a CLI argument (or config) -Wignore:Port but I'd assume we would never fix it then
I think it is a fair question: what feedback do we actually want from the CI? A long list of PASSED is not very useful either; only the list of failed tests and (optionally) their tracebacks. Similarly, pytest can make a warnings summary.
For catching all tests which leave a dangling HTTP server, one can use a fixture like this
@pytest.fixture(autouse=True)
def ensure_http_server_closed():
yield
from distributed import Scheduler
for s in Scheduler._instances:
assert s.http_server._stopped
Previous version:
@pytest.fixture(autouse=True)
def ensure_http_server_closed():
from tornado.ioloop import IOLoop
from distributed.comm import registry
from distributed.comm.addressing import parse_address
yield
loop = IOLoop.current()
async def _():
with pytest.raises(CommClosedError, match="Connection refused"):
address = "127.0.0.1:8787"
scheme, loc = parse_address(address)
backend = registry.get_backend(scheme)
connector = backend.get_connector()
comm = await connector.connect(address)
comm.abort()
loop.run_sync(_)
This is not very robust which is why I don't want to put this into our actual CI but in case this problem pops up again this might be of help for future-us
Lately we are observing hundreds of warnings about the dashboard ports being occupied. While I understand that we occasionally may have an overlap, I would expect every test to close the HTTP server properly such that these overlaps happen only infrequently. I can't help but wonder if this connects to our recent CI instabilities and/or CI performance.
Example https://github.com/dask/distributed/pull/4805/checks?check_run_id=2556648953