PaddlePaddle / PARL

A high-performance distributed training framework for Reinforcement Learning
https://parl.readthedocs.io/
Apache License 2.0
3.27k stars 822 forks source link

WRN No vacant cpu resources at the moment, will try 300 times later #432

Open rivellijp opened 4 years ago

rivellijp commented 4 years ago

I can't run any code with parl, always getting that error. This is how I start on my local machine, Windows 10:

xparl start --port 8010

# The Parl cluster is started at localhost:8010.
# A local worker with 8 CPUs is connected to the cluster.
# Starting the cluster monitor...
## If you want to check cluster status, please view:
http://192.168.1.99:61581
or call:
xparl status
## If you want to add more CPU resources, please call:
xparl connect --address 192.168.1.99:8010
## If you want to shutdown the cluster, please call:
xparl stop

And this is whatI get with status command:

xparl status

# Cluster localhost:8010 has 0 used cpus, 0 vacant cpus.
# If you want to check cluster status, please view: http://192.168.1.99:61721
zenghsh3 commented 4 years ago

Hi, thanks for your feedback. Can you provide more environment information?

rivellijp commented 4 years ago

Python 3.7.9 parl==1.3.2 running at command line

zenghsh3 commented 4 years ago

Hi, I cannot reproduce the error in the same running environment (win10, python3.7.9 and parl==1.3.2). The error looks like the worker cannot start normally, can you try to run the command: xparl connect --address 192.168.1.99:8010 after running the command xparl start --port 8010.

And tell us the error information.

rivellijp commented 4 years ago

I thought it was something about win10, as you couldn't reproduce the error I just cleaned up everything and reinstalled python and parl to same versions, now it's working. Thanks!

# Cluster localhost:8010 has 0 used cpus, 8 vacant cpus.

TomorrowIsAnOtherDay commented 4 years ago

Glad to hear that. Feel free to reopen the issue if you have other problems:)

rivellijp commented 4 years ago

I have the issue again, but now I have narrowed down a little more: Clean install of python + parl only, I can start, get status and stop many times, no issue # Cluster localhost:8010 has 0 used cpus, 8 vacant cpus.

But then, after installing pytorch (tried 1.6.0 and 1.7.0): # Cluster localhost:8010 has 0 used cpus, 0 vacant cpus.

Uninstalling pytorch, parl works again # Cluster localhost:8010 has 0 used cpus, 8 vacant cpus.

Somehow pytorch is messing up parl, any ideas?

zenghsh3 commented 4 years ago

Hi, I cannot reproduce the error again. (I installed torch==1.7.0) Maybe you can try to run the command: xparl connect --address 192.168.1.99:8010, and see what will happen.

R-Ceph commented 3 years ago

Hi, I met the same question when running the alphago project in benchmark . Python 3.7.9 parl==1.3.2 torch==1.7.0(tried both cpu and gpu version) running at command line in ubuntu 18.04

xparl status

[09-10 15:53:24 MainThread @logger.py:224] Argv: /home/hxu/anaconda3/envs/parl/bin/xparl connect --address 192.168.70.105:8010 /home/hxu/anaconda3/envs/parl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, *kwds) /home/hxu/anaconda3/envs/parl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(args, **kwds)

(parl) hxu@hxu:~/netease/PARL/benchmark/torch/AlphaZero$ xparl connect --address 192.168.70.105:8010

[09-10 15:53:24 MainThread @logger.py:224] Argv: /home/hxu/anaconda3/envs/parl/bin/xparl connect --address 192.168.70.105:8010 /home/hxu/anaconda3/envs/parl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, *kwds) /home/hxu/anaconda3/envs/parl/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(args, **kwds)

python main.py # in AlphaGo Dirs [09-10 15:53:34 MainThread @remote_decorator.py:178] WRN No vacant cpu resources at the moment, will try 300 times later.