autorope / donkeycar

Open source hardware and software platform to build a small scale self driving car.
http://www.donkeycar.com
MIT License
3.05k stars 1.28k forks source link

Training command Errors (TensorFlow/Python Incompatibility)? #1181

Closed generic-beat-detector closed 3 weeks ago

generic-beat-detector commented 1 month ago

Hello!

FWIW, this is truly a wonderful project!

Unfortunately, my limited skills can't even ### seem to get donkey train --tub data to work on Ubuntu 22.04 x86-64. The command also fails on RPi 4B bookworm but somehow works on the robocarstore RPi 4B pre-built-image @v5.0-dev3?

For the PC installs, I followed the (variations of) instructions here, there, and there

I'm using the same exact dataset in all scenarios:

$ ls 
calibrate.py  config.py  data  logs  manage.py  models  myconfig.py  train.py

$ ls data/
catalog_0.catalog  catalog_0.catalog_manifest  images  manifest.json

$ ls data/images/
0_cam_image_array_.jpg   15_cam_image_array_.jpg  5_cam_image_array_.jpg
10_cam_image_array_.jpg  16_cam_image_array_.jpg  6_cam_image_array_.jpg
11_cam_image_array_.jpg  1_cam_image_array_.jpg   7_cam_image_array_.jpg
12_cam_image_array_.jpg  2_cam_image_array_.jpg   8_cam_image_array_.jpg
13_cam_image_array_.jpg  3_cam_image_array_.jpg   9_cam_image_array_.jpg
14_cam_image_array_.jpg  4_cam_image_array_.jpg

Ubuntu 22.04, x86-64 (w/ RTX 3070)

$  donkey --version
using donkey v5.0.0 ...

$ python --version Python 3.11.2

$ donkey --version using donkey v5.1.0 ...

$ donkey train --tub data [...] INFO:donkeycar.pipeline.types:Loading tubs from paths ['data'] INFO:donkeycar.pipeline.training:Records # Training 13 INFO:donkeycar.pipeline.training:Records # Validation 4 INFO:donkeycar.parts.tub_v2:Closing tub data [...] INFO:donkeycar.parts.keras:////////// Starting training ////////// Epoch 1/100 2024-05-29 01:17:03.203388: W tensorflow/core/framework/op_kernel.cc:1827] INVALID_ARGUMENT: ValueError: Key image is not in available keys. Traceback (most recent call last):

File "/home/pi/projects/donkeycar/env/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in call ret = func(*args) ^^^^^^^^^^^

* However, on the robocarstore pre-built-image [@v5.0-dev3](https://github.com/robocarstore/donkeycar-images), I can run `donkey train --tub data` on the RPi 4B without any problems (Python 3.9.2)

$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye

$ python --version Python 3.9.2

$ donkey --version using donkey v5.0.dev3 ...

$ donkey train --tub data [...] INFO:donkeycar.pipeline.types:Loading tubs from paths ['data'] INFO:donkeycar.pipeline.training:Records # Training 13 INFO:donkeycar.pipeline.training:Records # Validation 4 INFO:donkeycar.parts.tub_v2:Closing tub data INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.pipeline.training:Train with image caching: True INFO:donkeycar.parts.keras:////////// Starting training ////////// Epoch 1/100 1/1 [==============================] - ETA: 0s - loss: 0.4888 - n_outputs0_loss: 0.0301 - n_outputs1_loss: 0.4587 Epoch 1: val_loss improved from inf to 0.14395, saving model to /home/pi/mycar/models/pilot_24-05-29_0.savedmodel [...] 1/1 [==============================] - 16s 16s/step - loss: 0.4888 - n_outputs0_loss: 0.0301 - n_outputs1_loss: 0.4587 - val_loss: 0.1440 - val_n_outputs0_loss: 0.0115 - val_n_outputs1_loss: 0.1324 Epoch 2/100 1/1 [==============================] - ETA: 0s - loss: 0.2921 - n_outputs0_loss: 0.0200 - n_outputs1_loss: 0.2721 Epoch 2: val_loss did not improve from 0.14395 1/1 [==============================] - 6s 6s/step - loss: 0.2921 - n_outputs0_loss: 0.0200 - n_outputs1_loss: 0.2721 - val_loss: 0.2925 - val_n_outputs0_loss: 0.0241 - val_n_outputs1_loss: 0.2683



What's going on here?

Regards.
DocGarbanzo commented 1 month ago

Can you please install the latest release, 5.1.0? Also, you have far too few data, try with around 1000 records and not 14. My suspicion is that there is a problem if you don't even have a single full sized batch in neither train nor validation set.

generic-beat-detector commented 1 month ago

@DocGarbanzo

Yes sir, going with your recommendation to install donkeycar v5.1.0 (which requires python >=3.11 and <=3.12), the training -- apparently -- succeeds (even with my 17 image test dataset):

... a few errors but the training process completed (early due to "no improvment in validation loss" -- my bogus dataset ;), and everthing looks A-okay. I'll have to test with a real dataset of course, but at least the (compatibility) issues seem to have been fixed!

Thank you sir.

generic-beat-detector commented 3 weeks ago

@DocGarbanzo,

Hi! So far, so good. I've just trained a model on a

$ ls -l data/images/ | wc -l
13960

image dataset, and it's quite lovely. The autopilot has completed several runs like a champ!

I could swear I previously run into an issue (seemingly Python v3.11 related) with donkey ui (a "recursion depth exceeded" type error) but I mysteriously cannot reproduce it. In any case, it is not a priority right now. I will let you know of any problems in another thread. Thanks once again.

DocGarbanzo commented 3 weeks ago

@DocGarbanzo,

Hi!

So far, so good. I've just trained a model on a


$ ls -l data/images/ | wc -l

13960

image dataset, and it's quite lovely. The autopilot has completed several runs like a champ!

I could swear I previously run into an issue (seemingly Python v3.11 related) with donkey ui (a "recursion depth exceeded" type error) but I mysteriously cannot reproduce it. In any case, it is not a priority right now. I will let you know of any problems in another thread. Thanks once again.

Ok, great. Thanks for confirming. The TF key error still is a bit concerning. We'll have an eye on it if that ever shows up again.