generic-beat-detector commented 6 months ago

Hello!

FWIW, this is truly a wonderful project!

Unfortunately, my limited skills can't even ### seem to get donkey train --tub data to work on Ubuntu 22.04 x86-64. The command also fails on RPi 4B bookworm but somehow works on the robocarstore RPi 4B pre-built-image @v5.0-dev3?

For the PC installs, I followed the (variations of) instructions here, there, and there

I'm using the same exact dataset in all scenarios:

$ ls 
calibrate.py  config.py  data  logs  manage.py  models  myconfig.py  train.py

$ ls data/
catalog_0.catalog  catalog_0.catalog_manifest  images  manifest.json

$ ls data/images/
0_cam_image_array_.jpg   15_cam_image_array_.jpg  5_cam_image_array_.jpg
10_cam_image_array_.jpg  16_cam_image_array_.jpg  6_cam_image_array_.jpg
11_cam_image_array_.jpg  1_cam_image_array_.jpg   7_cam_image_array_.jpg
12_cam_image_array_.jpg  2_cam_image_array_.jpg   8_cam_image_array_.jpg
13_cam_image_array_.jpg  3_cam_image_array_.jpg   9_cam_image_array_.jpg
14_cam_image_array_.jpg  4_cam_image_array_.jpg

Ubuntu 22.04, x86-64 (w/ RTX 3070)

$  donkey --version
using donkey v5.0.0 ...

Python 3.10.12

$ donkey train --tub data
[...]
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
[...]
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
2024-05-28 23:08:26.725746: W tensorflow/core/framework/op_kernel.cc:1733] INVALID_ARGUMENT: ValueError: Key image is not in available keys.
Traceback (most recent call last):

File "~/miniconda3/envs/donkey/lib/python3.10/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
ret = func(*args)
[...]
ValueError: Key image is not in available keys.
[...]
 [[{{node PyFunc}}]]
 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1722]
2024-05-29 01:23:29.995776: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
 [[{{node PyFunc}}]]

Python-3.9.19

$ donkey train --tub data
[...]
INFO:donkeycar.pipeline.types:Loading tubs from paths ['data']
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
INFO:donkeycar.parts.tub_v2:Closing tub data
[...]
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
2024-05-29 00:38:05.846618: W tensorflow/core/framework/op_kernel.cc:1733] INVALID_ARGUMENT: ValueError: Key image is not in available keys.
Traceback (most recent call last):

File "~/tmp-donkey/miniconda3/envs/donkey/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
ret = func(*args)
[...]
ValueError: Key image is not in available keys.
[...]
 [[{{node PyFunc}}]]
 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_1722]
2024-05-29 01:23:29.995776: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
 [[{{node PyFunc}}]]

RPi 4B

Bookworm, Python 3.11.2


$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 12 (bookworm)
Release:    12
Codename:   bookworm

$ python --version Python 3.11.2

$ donkey --version using donkey v5.1.0 ...

$ donkey train --tub data [...] INFO:donkeycar.pipeline.types:Loading tubs from paths ['data'] INFO:donkeycar.pipeline.training:Records # Training 13 INFO:donkeycar.pipeline.training:Records # Validation 4 INFO:donkeycar.parts.tub_v2:Closing tub data [...] INFO:donkeycar.parts.keras:////////// Starting training ////////// Epoch 1/100 2024-05-29 01:17:03.203388: W tensorflow/core/framework/op_kernel.cc:1827] INVALID_ARGUMENT: ValueError: Key image is not in available keys. Traceback (most recent call last):

File "/home/pi/projects/donkeycar/env/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in call ret = func(*args) ^^^^^^^^^^^

* However, on the robocarstore pre-built-image [@v5.0-dev3](https://github.com/robocarstore/donkeycar-images), I can run `donkey train --tub data` on the RPi 4B without any problems (Python 3.9.2)

$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye

$ python --version Python 3.9.2

$ donkey --version using donkey v5.0.dev3 ...

$ donkey train --tub data [...] INFO:donkeycar.pipeline.types:Loading tubs from paths ['data'] INFO:donkeycar.pipeline.training:Records # Training 13 INFO:donkeycar.pipeline.training:Records # Validation 4 INFO:donkeycar.parts.tub_v2:Closing tub data INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.parts.image_transformations:Creating ImageTransformations [] INFO:donkeycar.pipeline.training:Train with image caching: True INFO:donkeycar.parts.keras:////////// Starting training ////////// Epoch 1/100 1/1 [==============================] - ETA: 0s - loss: 0.4888 - n_outputs0_loss: 0.0301 - n_outputs1_loss: 0.4587 Epoch 1: val_loss improved from inf to 0.14395, saving model to /home/pi/mycar/models/pilot_24-05-29_0.savedmodel [...] 1/1 [==============================] - 16s 16s/step - loss: 0.4888 - n_outputs0_loss: 0.0301 - n_outputs1_loss: 0.4587 - val_loss: 0.1440 - val_n_outputs0_loss: 0.0115 - val_n_outputs1_loss: 0.1324 Epoch 2/100 1/1 [==============================] - ETA: 0s - loss: 0.2921 - n_outputs0_loss: 0.0200 - n_outputs1_loss: 0.2721 Epoch 2: val_loss did not improve from 0.14395 1/1 [==============================] - 6s 6s/step - loss: 0.2921 - n_outputs0_loss: 0.0200 - n_outputs1_loss: 0.2721 - val_loss: 0.2925 - val_n_outputs0_loss: 0.0241 - val_n_outputs1_loss: 0.2683



What's going on here?

Regards.

DocGarbanzo commented 5 months ago

Can you please install the latest release, 5.1.0? Also, you have far too few data, try with around 1000 records and not 14. My suspicion is that there is a problem if you don't even have a single full sized batch in neither train nor validation set.

generic-beat-detector commented 5 months ago

@DocGarbanzo

Yes sir, going with your recommendation to install donkeycar v5.1.0 (which requires python >=3.11 and <=3.12), the training -- apparently -- succeeds (even with my 17 image test dataset):

Installed miniconda virtual environment

$ wget https://repo.anaconda.com/miniconda/Miniconda3-py39_23.3.1-0-Linux-x86_64.sh
$ bash ./Miniconda3-py39_23.3.1-0-Linux-x86_64.sh

$ eval "$(/home/USER/dev-donkey/miniconda3/bin/conda shell.bash hook)"
$ conda create -n donkey python=3.11
$ conda activate donkey
$ python --version
Python 3.11.9

Downloded Donkey Car 5.1.0 to $PWD

$ unzip donkeycar-5.1.0.zip
$ cd donkeycar-5.1.0/
$ pip install -e .[pc]
$ cd ..
$ donkey createcar --path $PWD/mycar
$ cd mycar/
$ ls
calibrate.py  config.py  data  logs  manage.py  models  myconfig.py  train.py

Copied the very same 17 image dataset to data, then

$ donkey train --tub data

using donkey v5.1.0 ...
[...]
INFO:donkeycar.pipeline.types:Loading tubs from paths ['data']
INFO:donkeycar.pipeline.training:Records # Training 13
INFO:donkeycar.pipeline.training:Records # Validation 4
[...]
INFO:donkeycar.parts.keras:////////// Starting training //////////
Epoch 1/100
2024-06-05 00:48:34.244384: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:961] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inlinear/dropout/dropout/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
2024-06-05 00:48:34.517973: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8906
2024-06-05 00:48:35.751589: I external/local_xla/xla/service/service.cc:168] XLA service 0x7d6ab00145b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-06-05 00:48:35.751618: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3070, Compute Capability 8.6
2024-06-05 00:48:35.755479: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1717537715.808544  643092 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
1/1 [==============================] - ETA: 0s - loss: 0.4434 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.4187
Epoch 1: val_loss improved from inf to 0.46197, saving model to /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel
INFO:tensorflow:Assets written to: /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel/assets
1/1 [==============================] - 6s 6s/step - loss: 0.4434 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.4187 - val_loss: 0.4620 - val_n_outputs0_loss: 0.0269 - val_n_outputs1_loss: 0.4351
Epoch 2/100
1/1 [==============================] - ETA: 0s - loss: 0.3613 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.3364
Epoch 2: val_loss improved from 0.46197 to 0.31188, saving model to /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel
INFO:tensorflow:Assets written to: /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel/assets
1/1 [==============================] - 1s 794ms/step - loss: 0.3613 - n_outputs0_loss: 0.0248 - n_outputs1_loss: 0.3364 - val_loss: 0.3119 - val_n_outputs0_loss: 0.0225 - val_n_outputs1_loss: 0.2893
Epoch 3/100
1/1 [==============================] - ETA: 0s - loss: 0.2427 - n_outputs0_loss: 0.0228 - n_outputs1_loss: 0.22002024-06-05 00:48:40.575992: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: KeyError: 109
Traceback (most recent call last):

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
      ret = func(*args)
            ^^^^^^^^^^^

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/from_generator_op.py", line 290, in finalize_py_func
      generator_state.iterator_completed(iterator_id)

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 870, in iterator_completed
      del self._iterators[self._normalize_id(iterator_id)]
          ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

KeyError: 109

2024-06-05 00:48:40.576057: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: UNKNOWN: KeyError: 109
Traceback (most recent call last):

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__
      ret = func(*args)
            ^^^^^^^^^^^

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/from_generator_op.py", line 290, in finalize_py_func
      generator_state.iterator_completed(iterator_id)

    File "/home/USER/dev-donkey/miniconda3/envs/donkey/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 870, in iterator_completed
      del self._iterators[self._normalize_id(iterator_id)]
          ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

KeyError: 109

     [[{{node PyFunc}}]]

Epoch 3: val_loss improved from 0.31188 to 0.08692, saving model to /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel
INFO:tensorflow:Assets written to: /home/USER/dev-donkey/mycar/models/pilot_24-06-05_0.savedmodel/assets
1/1 [==============================] - 1s 790ms/step - loss: 0.2427 - n_outputs0_loss: 0.0228 - n_outputs1_loss: 0.2200 - val_loss: 0.0869 - val_n_outputs0_loss: 0.0067 - val_n_outputs1_loss: 0.0802
Epoch 4/100
1/1 [==============================] - ETA: 0s - loss: 0.1607 - n_outputs0_loss: 0.0319 - n_outputs1_loss: 0.1288
Epoch 4: val_loss did not improve from 0.08692
1/1 [==============================] - 0s 145ms/step - loss: 0.1607 - n_outputs0_loss: 0.0319 - n_outputs1_loss: 0.1288 - val_loss: 0.0976 - val_n_outputs0_loss: 0.0024 - val_n_outputs1_loss: 0.0952
Epoch 5/100
1/1 [==============================] - ETA: 0s - loss: 0.1222 - n_outputs0_loss: 0.0187 - n_outputs1_loss: 0.1035
Epoch 5: val_loss did not improve from 0.08692
1/1 [==============================] - 0s 160ms/step - loss: 0.1222 - n_outputs0_loss: 0.0187 - n_outputs1_loss: 0.1035 - val_loss: 0.1443 - val_n_outputs0_loss: 0.0027 - val_n_outputs1_loss: 0.1415
Epoch 6/100
1/1 [==============================] - ETA: 0s - loss: 0.1274 - n_outputs0_loss: 0.0132 - n_outputs1_loss: 0.1142
Epoch 6: val_loss did not improve from 0.08692
1/1 [==============================] - 0s 140ms/step - loss: 0.1274 - n_outputs0_loss: 0.0132 - n_outputs1_loss: 0.1142 - val_loss: 0.1452 - val_n_outputs0_loss: 0.0022 - val_n_outputs1_loss: 0.1431
Epoch 7/100
1/1 [==============================] - ETA: 0s - loss: 0.1173 - n_outputs0_loss: 0.0111 - n_outputs1_loss: 0.1062
Epoch 7: val_loss did not improve from 0.08692
1/1 [==============================] - 0s 139ms/step - loss: 0.1173 - n_outputs0_loss: 0.0111 - n_outputs1_loss: 0.1062 - val_loss: 0.1188 - val_n_outputs0_loss: 0.0017 - val_n_outputs1_loss: 0.1170
Epoch 8/100
1/1 [==============================] - ETA: 0s - loss: 0.1270 - n_outputs0_loss: 0.0182 - n_outputs1_loss: 0.1089
Epoch 8: val_loss did not improve from 0.08692
1/1 [==============================] - 0s 140ms/step - loss: 0.1270 - n_outputs0_loss: 0.0182 - n_outputs1_loss: 0.1089 - val_loss: 0.0971 - val_n_outputs0_loss: 0.0020 - val_n_outputs1_loss: 0.0951
INFO:donkeycar.parts.keras:////////// Finished training in: 0:00:08.441838 //////////
[...]
2024-06-05 00:48:45.071498: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 66754 microseconds.
[...]
INFO:donkeycar.parts.interpreter:TFLite conversion done.
INFO:donkeycar.pipeline.database:Writing database file: /home/USER/dev-donkey/mycar/models/database.json

... a few errors but the training process completed (early due to "no improvment in validation loss" -- my bogus dataset ;), and everthing looks A-okay. I'll have to test with a real dataset of course, but at least the (compatibility) issues seem to have been fixed!

Thank you sir.

generic-beat-detector commented 5 months ago

@DocGarbanzo,

Hi! So far, so good. I've just trained a model on a

$ ls -l data/images/ | wc -l
13960

image dataset, and it's quite lovely. The autopilot has completed several runs like a champ!

I could swear I previously run into an issue (seemingly Python v3.11 related) with donkey ui (a "recursion depth exceeded" type error) but I mysteriously cannot reproduce it. In any case, it is not a priority right now. I will let you know of any problems in another thread. Thanks once again.

DocGarbanzo commented 5 months ago

@DocGarbanzo,

Hi!

So far, so good. I've just trained a model on a
$ ls -l data/images/ | wc -l

13960
image dataset, and it's quite lovely. The autopilot has completed several runs like a champ!

I could swear I previously run into an issue (seemingly Python v3.11 related) with donkey ui (a "recursion depth exceeded" type error) but I mysteriously cannot reproduce it. In any case, it is not a priority right now. I will let you know of any problems in another thread. Thanks once again.

Ok, great. Thanks for confirming. The TF key error still is a bit concerning. We'll have an eye on it if that ever shows up again.

autorope / donkeycar

Training command Errors (TensorFlow/Python Incompatibility)? #1181

Ubuntu 22.04, x86-64 (w/ RTX 3070)

RPi 4B