-
## Autoscaling Trainer job on PaddleCloud
### Background
A Paddle training job contains several trainer instances(Kubernetes Job), several parameter server instances((Kubernetes ReplicaSet) and a …
-
When I submit a training task, job pods always crash and tell me that `No such file or directory`. Then I find that file paths do not match.
![image](https://user-images.githubusercontent.com/3349368…
-
1. k8s-aws/README.md refers to k8s/*.{sh,py,yaml}, but doesn't have links.
1. /home/admin/efs ==> /home/core/efs
1. drinkcode/paddle:k8s-job ==> paddle-dev/paddle:k8s
1. /home/jobpath => /efs
1. D…
-
when i try to convert a paddle trainer_config.conf to a binary conf, got this error:
```
Traceback (most recent call last):
File "/home/zc04/paddle/paddle_local_env/python27-gcc482/lib/python2.…
-
The Paddle trainers is scheduled by Kubernetes Job, when any Pod is failed, Kubernetes will start up a new Pod, so if the upload `train.py` exists with non-zero, there will be more and more Pod with a…
-
现在的paddle-cloud网站,如果连不上后台的mysql就立即退出,而pod并没有重启。
1. 加入重试机制
2. 分离paddle-cloud和mysql为两个服务。
-
I think the third trainer got started because the previous trainer crashed. But probably the ID should be the same ID as the crashed trainer.
-
In our recent discussion on PaddlePaddle Cloud, we are going to have more and more Go code. So, where should we put the code? Some possibilities:
1. To create a directory `paddle/go` in the `paddl…
-
Kubernetes Pod中不能直接PaddlePaddle product Docker image因为有几个脚本在PaddlePaddle中不存在: https://github.com/PaddlePaddle/Job/tree/develop/tools, 建议创建一个目录`./docker` 目录,来编译Cloud上运行的镜像,可以命名为`paddlecloud-job:0.10.0`…
-
submit.sh
```paddlecloud submit xxxxx ../my_package```
I put submit.sh into `my_package` directory. Uploading file job will be failed.
And i move `submit.sh` to parent directory, files can be up…