Open adypd97 opened 3 years ago
Hi @adypd97, thank you for your interest in HandyRL!
First of all, after the training server launched, you need to run the workers in the VMs for worker: python main.py --worker
(you should write the server address in the worker config (i.e. worker_args)) This command connects the workers to the server. After the server detects the worker connection, the learning process starts.
We illustrated the overview of the distributed architecture before in the Google Football Research competition. I hope this helps you.
Thanks
Hi @ikki407!
Thanks for the link to the documentation! Very helpful!
To the main issue: Yes, I ran 2 worker VMs following the steps you mention (also, I entered the public IP of server VM (learner) for both workers in the worker_args
parameter). Following that I got the OUTPUT mentioned in my initial comment. It seems like the learner is not able to detect the workers.
As further evidence for that I added a simple print statement to the following file ./handyrl/train.py
in the following function (starting line 404
):
def run(self):
print('waiting training')
while not self.shutdown_flag:
if len(self.episodes) < self.args['minimum_episodes']:
>>> print('here')
time.sleep(1)
continue
if self.steps == 0:
self.batcher.run()
print('started training')
model = self.train()
self.report_update(model, self.steps)
print('finished training')
And in the output I get the following: OUTPUT:
xyz@vm1:~/HandyRL$ python3 main.py --train-server {'env_args': {'env': 'HungryGeese'}, 'train_args': {'turn_based_training': False, 'observation': False, 'gamma': 0.8, 'forward_steps': 32, 'compress_steps': 4, 'entropy_regularization': 0.002, 'entropy_regularization_decay': 0.3, 'update_episodes': 500, 'batch_size': 400, 'minimum_episodes': 1000, 'maximum_episodes': 200000, 'epochs': -1, 'num_batchers': 7, 'eval_rate': 0.1, 'worker': {'num_parallel': 32}, 'lambda': 0.7, 'max_self_play_epoch': 1000, 'policy_target': 'TD', 'value_target': 'TD', 'eval': {'opponent': ['modelbase'], 'weights_path': 'None'}, 'seed': 0, 'restart_epoch': 0}, 'worker_args': {'server_address': '
', 'num_parallel': 32}} Loading environment football failed: No module named 'gfootball' started batcher 0 started batcher 1 started batcher 2 started batcher 3 started batcher 4 started batcher 5 waiting training started entry server 9999 started batcher 6 started worker server 9998 started server here here here...
I hope you find this helpful in assisting me. In any case thanks once again!
From your outputs, it seems that the server is not connecting to the workers.
Next steps to debug...
What the worker process/VM looks like? If the workers are still running without any errors, there maybe exist some problems I didn’t watch before.
Hello HandyRL Team!
First off, thanks for making such a useful repository for RL! I love it!
I am trying to understand how the distributed architecture of HandyRL works, but due to lack of documentation so far its been difficult to understand how it's implemented.
I'll give an example (following the Large Scale Training document in the repo):
I have 3 VMs running on GCP (1 as the server (the learner) and 2 other as workers). In the
config.yaml
file I entered the external IP (the document says its valid to enter the external IP too) of the learner in the worker args parameter for both workers (as per instructions in the document) and tried to run it. However, I don't see anything happen. In the following output the server appears to continue to sleep and does nothing.OUTPUT:
I was hoping you could provide some guidance as to how I can proceed. In any case, a documentation or brief but complete background on the distributed architecture would also be appreciated to debug the problem on my own.
Thank you!