autorope / donkeycar

Open source hardware and software platform to build a small scale self driving car.
http://www.donkeycar.com
MIT License
3.16k stars 1.3k forks source link

train done -Always no result #1104

Closed Ahrovan closed 1 year ago

Ahrovan commented 1 year ago

using donkey v4.4.dev6 ...

I have not been able to train and run for several days. I doubted my data. Finally, I repeated the work of train with your data

sim-tub2-min sim-tub-min

Ezward commented 1 year ago

@Ahrovan Given the 'user/steering'->'user/angle' bug in Issue https://github.com/autorope/donkeycar/issues/1103 that was just fixed, can you checkout the latest main and rewrite your mycar folder (or whatever folder you use for the donkeycar application). So from the root of your donkeycar repo:

git checkout main
donkey createcar --overwrite --path=<path-to-your-mycar>

and give it another try.

Ezward commented 1 year ago

PS: I gathered data, trained and ran autopilot using the new code. Remember that the steering data is labelled as 'user/angle' so if you changed that to 'user/steering' then you should change it back.

Ahrovan commented 1 year ago

sim_warehouse_manual.tar.gz extracted again. not solved - failed

sim-tub

Ahrovan commented 1 year ago

@Ezward We install all project from zero to make sure there is no problem

TensorFlow version 2.3.1 | donkey v4.4.dev6 | windows | conda 4.10.3 | Python 3.7.16 :: Anaconda, Inc.

myCar>donkey train --tub ./sim_warehouse_manual --model ./models/mypilot.h5

INFO:donkeycar.parts.interpreter:Convert model .........myNew\models\mypilot.h5 to TFLite ......myNew\models\mypilot.tflite

final-min

The problem is very simple. But the failure is constantly repeated. Please check to solve the problem as soon as possible

Ezward commented 1 year ago

@Ahrovan Are you training a 'linear' model but running it as a 'categorical' model; that is what I see in the screen captures above. That would be a problem. If you want to run a categorical model then you need to explictly train it as such. If you want an linear model then be sure to choose linear when you run the autopilot and not categorical.

Ahrovan commented 1 year ago

@Ezward no, both is linear. Problem just related to recording. I know but need to same

Ahrovan commented 1 year ago

@Ezward @DocGarbanzo

Python 3.7.16 | donkey v4.4.dev6 | tensorflow '2.2.0' | windows | linear model | Created New Conda Env today

1

mypilot

Ezward commented 1 year ago

@Ahrovan that does look wrong. Can you try running tubplot with your model and data to see what it outputs; https://docs.donkeycar.com/utility/donkey/#plot-predictions

Ezward commented 1 year ago

@Ahrovan Please also check your data again; I believe you renamed 'user/angle' to 'user/steering'; if you did you need to restore it to 'user/angle' and retrain.

Ahrovan commented 1 year ago

sim_warehouse_manual.tar.gz extracted without any change.
image 'user/angle' exist. @Ezward Today checked with branch 4.4.0 conda new Env, failed. same result.

Ahrovan commented 1 year ago

tubplot output now

Figure_1

Ahrovan commented 1 year ago

@Ezward The problem is related to train with gpu. If train operation is done with cpu, the problem will be solved.

Ezward commented 1 year ago

@DocGarbanzo can you take a look at this?

Ezward commented 1 year ago

@Ahrovan can you try to train with an earlier version of donkeycar so we can see if it is a recently introduced bug? Maybe checkout tag 4.3.6.2 or 4.4.0

Ahrovan commented 1 year ago

Report - donkey v4.4.dev6 worked on Jetson NX - Trained with GPU

Ezward commented 1 year ago

to summarize

Is that correct?

On the failing system can you train using the command line and copy the console output here?

Ahrovan commented 1 year ago

This shows that the model has no problem. Only the ui display has a problem. I have tested this model on the robot. Robot runs without problems

ui-with gpu + without gui-min

Ezward commented 1 year ago

Except that you also did a tubplot and it also showed constant throttle and steering, so it doesn't seem like it just a UI issue.

Ahrovan commented 1 year ago

3234 log.txt

Ezward commented 1 year ago

I don't understand what is happening in the prior comment; can you add an explanation?

Ezward commented 1 year ago

Maybe you are saying one of the modes is always getting that bad record NaN error? And a white screen. Is that the same thing that was going on before? It looks different.

Ahrovan commented 1 year ago

yes, white screen is related model trained with GPU always getting bad record NaN error. yes, i don't know

Ezward commented 1 year ago

@DocGarbanzo have you seen this issue before in the DonkeyUI? Do you know what this might be?

Ahrovan commented 1 year ago
Ahrovan commented 1 year ago

Problem solved. cudnn 8.1 CUDA 11.2 TensorFlow 2.5 (CUDA >= 11) now I am working on compatibility issues on Host and Jetson platform. @Ezward

Ezward commented 1 year ago

Great. I am glad you have a solution. Closing this issue.