dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.57k stars 718 forks source link

2 machines, timed out after X seconds #6768

Closed windowshopr closed 2 years ago

windowshopr commented 2 years ago

I opened a separate issue #6731 there by accident as that’s a slightly separate issue, so was advised to open it here. I took some time to make it easy to read and reproduce with a minimal reproducible code.

I'm having this issue currently. I can provide a reproducible code, and some details about my setup. This is an extension of my question asked on SO here, however I figured out that question, now stuck on this Timed out issue.

PROBLEM DESCRIPTION I'm trying to utilize a distributed cluster across 2 machines for a TPOT machine learning run. I do not think this is an issue with TPOT but rather with distributed. I'm able to setup my scheduler, and have both machines connect to it and the TPOT run starts, but after 2 minutes I get a timed out error on one of the workers, even though both machines are still processing something.

ENVIRONMENT

2 x Windows 10 machines/workers
Python 3.7.9
TPOT==0.11.7
scikit-learn==1.0.2
dask==2022.2.0
distributed==2022.2.0
numpy==1.21.4
pandas==1.2.5

NETWORK MAP Machine 1 (Main) - Local IP Address: 172.16.1.113 Machine 2 (Secondary) - Local IP Address: 172.16.1.82

On both machines, I have opened inbound/outbound rules for ports 8786, and 8789-8795 for communication purposes/to try and mitigate any dumb Windows firewall issues.

STEPS TO REPRODUCE

  1. Machine 1 - Open PowerShell window (as admin) and run the command dask-scheduler to start the scheduler.
  2. Machine 1 - Open second PowerShell window (as admin) and run the command dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4 --worker-port=8789
  3. Machine 2 - Open PowerShell window (as admin) and run the command dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4 --worker-port=8789

So effectively, Machine 1 connects to itself (the scheduler), and Machine 2 connects to machine 1's scheduler. When inspecting http://localhost:8787/status, I see both workers connected and ready to go.

  1. Run this minimal reproducible code on Machine 1 in another PowerShell window or your IDE of choice. This code closely replicates my real use case, hence the shape and weight imbalances (don't worry about it for now, unless the size of the dataset is too big for your machine, you can downsize if needed, like change to n_samples=5000):
from dask.distributed import Client, Worker
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, TimeSeriesSplit
from tpot import TPOTClassifier

# ------------------------------------------------------------------------------------------------ #
#                                  START WORKER/CLIENT IN SCRIPT                                   #
# ------------------------------------------------------------------------------------------------ #
client = Client("tcp://172.16.1.113:8786")

# ------------------------------------------------------------------------------------------------ #
#                                    MAKE CLASSIFICATION DATASET                                   #
# ------------------------------------------------------------------------------------------------ #
X, y = make_classification(n_samples=10000,
                           n_features=538,
                           n_informative=200,
                           n_classes=3,
                           weights={0:0.996983388, 
                                    1:0.001515257,
                                    2:0.001501355,
                                    },
                           random_state=42,
                           )

# ------------------------------------------------------------------------------------------------ #
#                                         TRAIN TEST SPLIT                                         #
# ------------------------------------------------------------------------------------------------ #
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.15,
                                                    )
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# ------------------------------------------------------------------------------------------------ #
#                                    CREATE THE TPOT CLASSIFIER                                    #
# ------------------------------------------------------------------------------------------------ #
tpot = TPOTClassifier(generations=100, 
                     population_size=40,
                     offspring_size=None, 
                     mutation_rate=0.9,
                     crossover_rate=0.1,
                     scoring='balanced_accuracy',
                     cv=TimeSeriesSplit(n_splits=3), # Using time series split here
                     subsample=1.0, 
                    #  n_jobs=-1,
                     max_time_mins=None, 
                     max_eval_time_mins=10, # 5
                     random_state=None, 
                    #  config_dict=classifier_config_dict,
                     template=None,
                     warm_start=False,
                     memory=None,
                     use_dask=True,
                     periodic_checkpoint_folder=None,
                     early_stop=2,
                     verbosity=2,
                     disable_update_check=False)

results = tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_pipeline.py')

# Now check http://localhost:8787/status and resources on both worker machines

After a few minutes, both machines get spun up and resources are being used. Here is from Machine 1 (main):

image

Now, after 2 minutes (as that is what I set in the distributed.yaml config file), on Machine 1's worker PowerShell window (NOT the scheduler window), I get the following traceback:

image

The timed out message comes back every 2 minutes from then on. Both workers continue working for a while (several minutes, doing the TPOT machine learning stuff), then they spin down, and when I check the TPOT output, it looks like nothing has happened that whole time...

image

So it just hangs, the scheduler is still going, both workers are still showing up in the dashboard, but the worker PowerShell window on Machine 1 just keeps repeating the timed out message and the TPOT run doesn't progress.

ACTIONS ALREADY TAKEN

  1. Increasing the default 60s timeout in the config file, which I increased to 120s and I COULD increase more, however I'm not 100% sure this is the issue.
  2. Specify specific worker ports on both machines 8789. Have also tried setting 1 worker's port to 8789 and the other to 8790 to mitigate same port issues.
  3. Decreased the dataset's size to:
    X, y = make_classification(n_samples=5000,
                           n_features=150,
                           n_informative=50,
                           n_classes=3,
                           weights={0:0.996983388, 
                                    1:0.001515257,
                                    2:0.001501355,
                                    },
                           random_state=42,
                           )
  4. Let the script run for an hour, only to come back to TPOT's progress still showing 0% and 0 pipelines tested, and both workers stalled with a bunch of those timed out errors on Machine 1's worker PowerShell window.

I hope this has been detailed enough for reproducibility. I know I'm running TPOT here, but I think it's a distributed issue re: connections timing out in the cluster.

As you can see, I'm sort of new to using distributed with TPOT and running it from the command line/PowerShell, however the only guide on it shows only how to run a local cluster (on one machine) not multiple machines, nor how to help with connections timing out like this. I also referenced this page for these issues, however I've already changed the config file and am running all upgraded dependencies.

Would love some input on this!!! I want to crank up both machines to run my TPOT! Thanks!!!!

gjoseph92 commented 2 years ago

@windowshopr 2022.2.0 is a relatively old version, and a lot has changed since then. Can you upgrade to the latest version of dask and distributed and try again? If you're still having the same problem (wouldn't be surprised if you are), can you re-post the traceback you're seeing with the latest version?

windowshopr commented 2 years ago

It is? It's the latest stable version that gets installed when I run pip install -U dask distributed, or when I do:

git clone https://github.com/dask/dask.git
cd dask
python -m pip install -e .

How do I get a different version? What command should I run?

windowshopr commented 2 years ago

Cancel that, I see 2022.7 in the changelogs. Sorry. Will post in a bit!

windowshopr commented 2 years ago

GOOD NEWS!!!! 😄 haha the update helped. Though it was a bit of process, but I'll reproduce all my steps here for others in a similar boat in the future.

I was running Python 3.7.9, hence the install not grabbing the latest version of distributed issue. I've upgraded both machines to the latest 3.10.5 Python, and ran pip install -U dask[complete] as well as installed TPOT and other dependencies again (after troubleshooting some errors) with:

pip install setuptools<58
pip install deap update_checker tqdm stopit xgboost fsspec>=0.3.3
pip install tpot
pip install dask_ml --user
pip install torch torchvision torchaudio

NOW, I followed the same steps as before, only this time for testing, I did NOT set a worker-port on either machine, so the steps look like this this time:

STEPS TO REPRODUCE

  1. Machine 1 - Open PowerShell window (as admin) and run the command dask-scheduler to start the scheduler.
  2. Machine 1 - Open second PowerShell window (as admin) and run the command dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4
  3. Machine 2 - Open PowerShell window (as admin) and run the command dask-worker tcp://172.16.1.113:8786 --nprocs 1 --nthreads 4

So effectively, Machine 1 connects to itself (the scheduler), and Machine 2 connects to machine 1's scheduler. When inspecting http://localhost:8787/status, I see both workers connected and ready to go.

  1. Run this minimal reproducible code on Machine 1 in another PowerShell window or your IDE of choice. This code closely replicates my real use case, hence the shape and weight imbalances (don't worry about it for now, unless the size of the dataset is too big for your machine, you can downsize if needed, like change to n_samples=5000):

When I run the sample code in VS Code, both machines spin up like normal (albeit much faster this time, not much waiting 👍), and I see outputs in the worker PowerShell windows from sklearn, so the first TPOT generation is running on both machines, which is good. And after a few minutes of scheduling coordinating some results, I see progress in the tqdm output of TPOT:

image

WOOHOO!!!

So now I'm running a LAN cluster for my TPOT run!

Thanks @gjoseph92 for the direction! Closing issue now.