Once the training is started, and is needed to stop training gives error 'training stopped with status error', also the model is not saved in the local profile

ARCC-RACE / deepracer-for-dummies

a quick way to get up and running with local deepracer training environment

66 stars 28 forks source link

Once the training is started, and is needed to stop training gives error 'training stopped with status error', also the model is not saved in the local profile #35

Closed KushagraMakharia closed 5 years ago

KushagraMakharia commented 5 years ago

Can anyone please tell me how can I save the model, that I have kept for training. Is it supposed to end on its own or are we supposed to stop it. If earlier, how to configure it?

Michael-Equi commented 5 years ago

You should be able to stop training and then save the model using the save profile option in the menu bar. Once you enter a unique name it will save the model, checkpoints, and various settings that you had at the time. To reload that model select load model and type in the same name that you entered. To use it you will want to make sure that the use pretrained is on (read the log and use the pretrained button if necesary).

Michael-Equi commented 5 years ago

An error stopping training could mean that training did not start correctly. Try running docker ps and you should have 4 containers running.

KushagraMakharia commented 5 years ago

While running docker ps, I can just see two containers, i.e robomaker and minio

Michael-Equi commented 5 years ago

Looks like rl coach and sagemaker are not starting. Can you try running the aschu/rl_coach image if it exists (run docker images to see if it exists). Check the output my guess is that something is failing before it could spawn sagemaker.

KushagraMakharia commented 5 years ago

The image aschu/rl_coach is present and upon running giving error: EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:9000/bucket/rl-deepracer-sagemaker/source/sourcedir.tar.gz"

KushagraMakharia commented 5 years ago

The link seems to be working, but it redirects to some login page http://127.0.0.1:9000/minio/login

KushagraMakharia commented 5 years ago

After making a few changes in rl_deepracer_coach_robomaker.py it finally started! Thanks for the guidance Michael, really appreciated! :)