Closed windowshopr closed 2 years ago
@windowshopr 2022.2.0
is a relatively old version, and a lot has changed since then. Can you upgrade to the latest version of dask and distributed and try again? If you're still having the same problem (wouldn't be surprised if you are), can you re-post the traceback you're seeing with the latest version?
It is? It's the latest stable version that gets installed when I run pip install -U dask distributed
, or when I do:
git clone https://github.com/dask/dask.git
cd dask
python -m pip install -e .
How do I get a different version? What command should I run?
Cancel that, I see 2022.7
in the changelogs. Sorry. Will post in a bit!
GOOD NEWS!!!! 😄 haha the update helped. Though it was a bit of process, but I'll reproduce all my steps here for others in a similar boat in the future.
I was running Python 3.7.9, hence the install not grabbing the latest version of distributed
issue. I've upgraded both machines to the latest 3.10.5
Python, and ran pip install -U dask[complete]
as well as installed TPOT and other dependencies again (after troubleshooting some errors) with:
pip install setuptools<58
pip install deap update_checker tqdm stopit xgboost fsspec>=0.3.3
pip install tpot
pip install dask_ml --user
pip install torch torchvision torchaudio
NOW, I followed the same steps as before, only this time for testing, I did NOT set a worker-port
on either machine, so the steps look like this this time:
STEPS TO REPRODUCE
dask-scheduler
to start the scheduler.dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4
dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4
So effectively, Machine 1 connects to itself (the scheduler), and Machine 2 connects to machine 1's scheduler. When inspecting http://localhost:8787/status, I see both workers connected and ready to go.
When I run the sample code in VS Code, both machines spin up like normal (albeit much faster this time, not much waiting 👍), and I see outputs in the worker PowerShell windows from sklearn, so the first TPOT generation is running on both machines, which is good. And after a few minutes of scheduling coordinating some results, I see progress in the tqdm
output of TPOT:
WOOHOO!!!
So now I'm running a LAN cluster for my TPOT run!
Thanks @gjoseph92 for the direction! Closing issue now.
I opened a separate issue #6731 there by accident as that’s a slightly separate issue, so was advised to open it here. I took some time to make it easy to read and reproduce with a minimal reproducible code.
I'm having this issue currently. I can provide a reproducible code, and some details about my setup. This is an extension of my question asked on SO here, however I figured out that question, now stuck on this
Timed out
issue.PROBLEM DESCRIPTION I'm trying to utilize a distributed cluster across 2 machines for a TPOT machine learning run. I do not think this is an issue with TPOT but rather with distributed. I'm able to setup my scheduler, and have both machines connect to it and the TPOT run starts, but after 2 minutes I get a timed out error on one of the workers, even though both machines are still processing something.
ENVIRONMENT
NETWORK MAP Machine 1 (Main) - Local IP Address: 172.16.1.113 Machine 2 (Secondary) - Local IP Address: 172.16.1.82
On both machines, I have opened inbound/outbound rules for ports 8786, and 8789-8795 for communication purposes/to try and mitigate any dumb Windows firewall issues.
STEPS TO REPRODUCE
dask-scheduler
to start the scheduler.dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4 --worker-port=8789
dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4 --worker-port=8789
So effectively, Machine 1 connects to itself (the scheduler), and Machine 2 connects to machine 1's scheduler. When inspecting
http://localhost:8787/status
, I see both workers connected and ready to go.n_samples=5000
):After a few minutes, both machines get spun up and resources are being used. Here is from Machine 1 (main):
Now, after 2 minutes (as that is what I set in the
distributed.yaml
config file), on Machine 1's worker PowerShell window (NOT the scheduler window), I get the following traceback:The timed out message comes back every 2 minutes from then on. Both workers continue working for a while (several minutes, doing the TPOT machine learning stuff), then they spin down, and when I check the TPOT output, it looks like nothing has happened that whole time...
So it just hangs, the scheduler is still going, both workers are still showing up in the dashboard, but the worker PowerShell window on Machine 1 just keeps repeating the timed out message and the TPOT run doesn't progress.
ACTIONS ALREADY TAKEN
60s
timeout in the config file, which I increased to120s
and I COULD increase more, however I'm not 100% sure this is the issue.8789
. Have also tried setting 1 worker's port to8789
and the other to8790
to mitigate same port issues.I hope this has been detailed enough for reproducibility. I know I'm running TPOT here, but I think it's a distributed issue re: connections timing out in the cluster.
As you can see, I'm sort of new to using distributed with TPOT and running it from the command line/PowerShell, however the only guide on it shows only how to run a local cluster (on one machine) not multiple machines, nor how to help with connections timing out like this. I also referenced this page for these issues, however I've already changed the config file and am running all upgraded dependencies.
Would love some input on this!!! I want to crank up both machines to run my TPOT! Thanks!!!!