autorope / donkeycar

Open source hardware and software platform to build a small scale self driving car.
http://www.donkeycar.com
MIT License
3.16k stars 1.3k forks source link

Failed to allocate memory #1093

Closed Ahrovan closed 1 year ago

Ahrovan commented 1 year ago

donkey train --tub ./data --model ./models/myModel.h5

2 root error(s)

(0) Resource exhausted: Failed to allocate memory for the batch of component 0 [[node IteratorGetNext (defined at /projects/donkeycar/donkeycar/parts/keras.py:183) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[IteratorGetNext/_6]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: Failed to allocate memory for the batch of component 0 [[node IteratorGetNext (defined at /projects/donkeycar/donkeycar/parts/keras.py:183) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. image

Hardware/Software Details:

Ezward commented 1 year ago

That branch is 6 years old. You should try the main branch. We are also close to merging a new branch that upgrades to jp4.6 and Tensorflow 2.9.

Ahrovan commented 1 year ago

Thank you, @Ezward how to try new? must to be install jp4.6 ?

Ahrovan commented 1 year ago

@Ezward Failed to allocate memory

using donkey v4.4.4-main ... INFO:donkeycar.config:loading config file: ./config.py INFO:donkeycar.config:loading personal config over-rides from ./myconfig.py 2023-02-11 19:44:42.958070: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2 WARNING:donkeycar.pipeline.database:No model database found at /home/ahrovan/Projects/myCar/models/database.json INFO:donkeycar.utils:get_model_by_type: model type is: linear 2023-02-11 19:44:52.458462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1 2023-02-11 19:44:52.499566: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:52.499729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1742] Found device 0 with properties: pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3 coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s 2023-02-11 19:44:52.499810: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2 2023-02-11 19:44:52.663623: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2023-02-11 19:44:52.736950: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2023-02-11 19:44:52.843827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2023-02-11 19:44:52.975454: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2023-02-11 19:44:53.048683: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2023-02-11 19:44:53.052396: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-02-11 19:44:53.052787: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:53.053117: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:53.053198: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0 2023-02-11 19:44:53.084356: W tensorflow/core/platform/profile_utils/cpu_utils.cc:108] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency 2023-02-11 19:44:53.085005: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3eb00150 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2023-02-11 19:44:53.085078: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2023-02-11 19:44:53.222570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:53.223062: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3eb7a360 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-02-11 19:44:53.223133: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA Tegra X1, Compute Capability 5.3 2023-02-11 19:44:53.248679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:53.248849: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1742] Found device 0 with properties: pciBusID: 0000:00:00.0 name: NVIDIA Tegra X1 computeCapability: 5.3 coreClock: 0.9216GHz coreCount: 1 deviceMemorySize: 3.87GiB deviceMemoryBandwidth: 194.55MiB/s 2023-02-11 19:44:53.248996: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2 2023-02-11 19:44:53.249223: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2023-02-11 19:44:53.249356: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10 2023-02-11 19:44:53.249468: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10 2023-02-11 19:44:53.249573: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10 2023-02-11 19:44:53.249676: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10 2023-02-11 19:44:53.249780: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8 2023-02-11 19:44:53.250062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:53.250353: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:53.250446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1884] Adding visible gpu devices: 0 2023-02-11 19:44:53.250572: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2 2023-02-11 19:44:57.660757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1283] Device interconnect StreamExecutor with strength 1 edge matrix: 2023-02-11 19:44:57.660846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1289] 0 2023-02-11 19:44:57.660895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1302] 0: N 2023-02-11 19:44:57.661470: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:57.661845: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1046] ARM64 does not support NUMA - returning NUMA node zero 2023-02-11 19:44:57.662002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1428] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1080 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0, compute capability: 5.3) INFO:donkeycar.parts.keras:Created KerasLinear with interpreter: KerasInterpreter Model: "linear"


Layer (type) Output Shape Param # Connected to

img_in (InputLayer) [(None, 600, 800, 3) 0


conv2d_1 (Conv2D) (None, 298, 398, 24) 1824 img_in[0][0]


dropout (Dropout) (None, 298, 398, 24) 0 conv2d_1[0][0]


conv2d_2 (Conv2D) (None, 147, 197, 32) 19232 dropout[0][0]


dropout_1 (Dropout) (None, 147, 197, 32) 0 conv2d_2[0][0]


conv2d_3 (Conv2D) (None, 72, 97, 64) 51264 dropout_1[0][0]


dropout_2 (Dropout) (None, 72, 97, 64) 0 conv2d_3[0][0]


conv2d_4 (Conv2D) (None, 70, 95, 64) 36928 dropout_2[0][0]


dropout_3 (Dropout) (None, 70, 95, 64) 0 conv2d_4[0][0]


conv2d_5 (Conv2D) (None, 68, 93, 64) 36928 dropout_3[0][0]


dropout_4 (Dropout) (None, 68, 93, 64) 0 conv2d_5[0][0]


flattened (Flatten) (None, 404736) 0 dropout_4[0][0]


dense_1 (Dense) (None, 100) 40473700 flattened[0][0]


dropout_5 (Dropout) (None, 100) 0 dense_1[0][0]


dense_2 (Dense) (None, 50) 5050 dropout_5[0][0]


dropout_6 (Dropout) (None, 50) 0 dense_2[0][0]


n_outputs0 (Dense) (None, 1) 51 dropout_6[0][0]


n_outputs1 (Dense) (None, 1) 51 dropout_6[0][0]

Total params: 40,625,028 Trainable params: 40,625,028 Non-trainable params: 0


None Using catalog /home/ahrovan/Projects/myCar/data/catalog_0.catalog INFO:donkeycar.pipeline.types:Loading tubs from paths ['./data'] Records # Training 641 Records # Validation 161 Epoch 1/100 2023-02-11 19:45:06.600075: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10 2023-02-11 19:45:26.891143: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 1474560000 exceeds 10% of free system memory. Traceback (most recent call last): File "/home/ahrovan/env/bin/donkey", line 33, in sys.exit(load_entry_point('donkeycar', 'console_scripts', 'donkey')()) File "/home/ahrovan/Projects/donkeycar/donkeycar/management/base.py", line 626, in execute_from_command_line c.run(args[2:]) File "/home/ahrovan/Projects/donkeycar/donkeycar/management/base.py", line 563, in run args.comment) File "/home/ahrovan/Projects/donkeycar/donkeycar/pipeline/training.py", line 158, in train show_plot=cfg.SHOW_PLOT) File "/home/ahrovan/Projects/donkeycar/donkeycar/parts/keras.py", line 183, in train use_multiprocessing=False) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper return method(self, *args, kwargs) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit tmp_logs = train_function(iterator) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in call result = self._call(*args, *kwds) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call return self._stateless_fn(args, kwds) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2829, in call return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call cancellation_manager=cancellation_manager) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call ctx=ctx) File "/home/ahrovan/env/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: Failed to allocate memory for the batch of component 0 [[node IteratorGetNext (defined at /Projects/donkeycar/donkeycar/parts/keras.py:183) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: Failed to allocate memory for the batch of component 0 [[node IteratorGetNext (defined at /Projects/donkeycar/donkeycar/parts/keras.py:183) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

     [[IteratorGetNext/_6]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_1794]

Function call stack: train_function -> train_function

Ezward commented 1 year ago

What machine are you running training on? Maybe your are trying to train on a Jetson Nano? I have heard of others that did this, but it is not a supported training machine. I guess I would try to reduce the batch size by changing BATCH_SIZE configuration; try 16.

Ahrovan commented 1 year ago

@Ezward yes try to train with Jetson Nano B01, I changed BATCH_SIZE to 1, but failed

DocGarbanzo commented 1 year ago

@Ahrovan - you seem to be using an image size of 600x800. The linear model that you are trying to train is geared towards an image size of 120x160, and has about 500k parameters for that size. For a 600x800 you would need a model with higher compression, i.e. more layers or larger strides. You can see that your model now has 40,000,000 parameters (look at the Flattened layer from your model: you are coming out with a 400k dimensional vector and then going into a 100d dense layer, this basically gives you the 40m parameters). This model is far to big to fit into the ram of the nano. So either, set the image size to the standard of 120x160 or modify the model architecture.

Ahrovan commented 1 year ago

@DocGarbanzo Thank you

Ezward commented 1 year ago

Note that image should be 120 high, 160 wide. @Ahrovan have you retried using the new image size?

Ezward commented 1 year ago

@Ahrovan I am going to close this; if you have more info please do add a comment to the closed issue.