fizyr / keras-retinanet

Keras implementation of RetinaNet object detection.
Apache License 2.0
4.38k stars 1.96k forks source link

Getting an mAP of 0.000 using retinanet csv #1473

Closed andrewboyes closed 3 years ago

andrewboyes commented 4 years ago

I am currently using fizyr/retinanet to train a model that detects 3 classes. When I train the model, I receive precisions of 0.0000 on all my classes. In some rounds of training, I received slightly higher precisions e.g. 0.0007.

I have looked at these threads, but it doesn't seem like their solutions work: https://github.com/fizyr/keras-retinanet/issues/647 and https://github.com/fizyr/keras-retinanet/issues/1351

That is, I added the --image-max-side argument to my training command. I made this 2560 pixels. The images I am working with are 1920X2560 pixels. Training set is 916 images. Validation set is 258 images.

The full command that I use to train the model is:

python train.py \
    --weights old_snapshots/resnet50_coco_best_v2.h5 \
    --backbone resnet50 \
    --batch-size 1 \
    --image-max-side 2560 \
    --epochs 50 \
    --steps 200 \
    --lr 1e-8 \
    --snapshot-path new_snapshots \
    --tensorboard-dir logs \
    --random-transform \
    csv \
    train.csv \
    classes.csv \
    --val-annotations validation.csv

I have also tried running the above command without initializing the weights to coco. This produces the same result. I have copied the train.py file into my parent directory (and changed imports to absolute path).

I had to include this extra piece of code in train.py so that training did not get stopped by GPU running out of resources:

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

Here is a sample from my train.csv file:

dataset/202009/2020-09-18_20-26-16-480016.jpg,645,1178,819,1366,object1
dataset/202009/2020-09-18_20-26-16-480016.jpg,669,1306,1015,1486,object2
dataset/202009/2020-09-14_07-13-59-258711.jpg,,,,,
dataset/202009/2020-09-14_18-58-25-411295.jpg,,,,,
dataset/202009/2020-09-21_20-43-20-525886.jpg,1154,1214,1501,1429,object2
dataset/202009/2020-09-21_20-43-20-525886.jpg,1509,1176,1707,1396,object1
dataset/202009/2020-09-14_19-32-17-116910.jpg,,,,,

Here is my classes.csv file:

object1,0
object2,1
object3,2

My installation setup is: Windows 10 Tensflow 2.3.1 CUDA Toolkit 11.0 CuDNN v7.6.3

The precision does not change over multiple epochs. Here is a sample of the output:

2020-10-06 13:15:11.249841: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
2020-10-06 13:15:13.280542: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library nvcuda.dll
2020-10-06 13:15:13.326726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
...
2020-10-06 13:15:13.419482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Creating model, this may take a second...
2020-10-06 13:15:14.118835: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-10-06 13:15:14.142741: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x21262d9de70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.151023: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-06 13:15:14.157396: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro T2000 computeCapability: 7.5
coreClock: 1.785GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 119.24GiB/s
2020-10-06 13:15:14.169198: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
...
2020-10-06 13:15:14.208255: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:14.214516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-10-06 13:15:14.783905: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-06 13:15:14.791223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0
2020-10-06 13:15:14.797282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N
2020-10-06 13:15:14.801179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2905 MB memory) -> physical GPU (device: 0, name:
Quadro T2000, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-10-06 13:15:14.818767: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2120ce3da40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-10-06 13:15:14.825738: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Quadro T2000, Compute Capability 7.5
Model: "retinanet"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            [(None, None, None,  0
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, None, None, 6 9408        input_1[0][0]
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, None, None, 6 256         conv1[0][0]
__________________________________________________________________________________________________
conv1_relu (Activation)         (None, None, None, 6 0           bn_conv1[0][0]
__________________________________________________________________________________________________
pool1 (MaxPooling2D)            (None, None, None, 6 0           conv1_relu[0][0]
__________________________________________________________________________________________________
res2a_branch2a (Conv2D)         (None, None, None, 6 4096        pool1[0][0]
__________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizati (None, None, None, 6 256         res2a_branch2a[0][0]
__________________________________________________________________________________________________
res2a_branch2a_relu (Activation (None, None, None, 6 0           bn2a_branch2a[0][0]
__________________________________________________________________________________________________
...

P4_merged (Add)                 (None, None, None, 2 0           P5_upsampled[0][0]
                                                                 C4_reduced[0][0]
__________________________________________________________________________________________________
P4_upsampled (UpsampleLike)     (None, None, None, 2 0           P4_merged[0][0]
                                                                 res3d_relu[0][0]
__________________________________________________________________________________________________
C3_reduced (Conv2D)             (None, None, None, 2 131328      res3d_relu[0][0]
__________________________________________________________________________________________________
P6 (Conv2D)                     (None, None, None, 2 4718848     res5c_relu[0][0]
__________________________________________________________________________________________________
P3_merged (Add)                 (None, None, None, 2 0           P4_upsampled[0][0]
                                                                 C3_reduced[0][0]
__________________________________________________________________________________________________
C6_relu (Activation)            (None, None, None, 2 0           P6[0][0]
__________________________________________________________________________________________________
P3 (Conv2D)                     (None, None, None, 2 590080      P3_merged[0][0]
__________________________________________________________________________________________________
P4 (Conv2D)                     (None, None, None, 2 590080      P4_merged[0][0]
__________________________________________________________________________________________________
P5 (Conv2D)                     (None, None, None, 2 590080      C5_reduced[0][0]
__________________________________________________________________________________________________
P7 (Conv2D)                     (None, None, None, 2 590080      C6_relu[0][0]
__________________________________________________________________________________________________
regression_submodel (Functional (None, None, 4)      2443300     P3[0][0]
                                                                 P4[0][0]
                                                                 P5[0][0]
                                                                 P6[0][0]
                                                                 P7[0][0]
__________________________________________________________________________________________________
classification_submodel (Functi (None, None, 3)      2422555     P3[0][0]
                                                                 P4[0][0]
                                                                 P5[0][0]
                                                                 P6[0][0]
                                                                 P7[0][0]
__________________________________________________________________________________________________
regression (Concatenate)        (None, None, 4)      0           regression_submodel[0][0]
                                                                 regression_submodel[1][0]
                                                                 regression_submodel[2][0]
                                                                 regression_submodel[3][0]
                                                                 regression_submodel[4][0]
__________________________________________________________________________________________________
classification (Concatenate)    (None, None, 3)      0           classification_submodel[0][0]
                                                                 classification_submodel[1][0]
                                                                 classification_submodel[2][0]
                                                                 classification_submodel[3][0]
                                                                 classification_submodel[4][0]
==================================================================================================
Total params: 36,424,447
Trainable params: 36,318,207
Non-trainable params: 106,240
__________________________________________________________________________________________________
None
WARNING:tensorflow:`batch_size` is no longer needed in the `TensorBoard` Callback and will be ignored in TensorFlow 2.0.
2020-10-06 13:15:17.712389: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
2020-10-06 13:15:17.721698: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1391] Profiler found 1 GPUs
2020-10-06 13:15:17.749155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cupti64_101.dll
2020-10-06 13:15:17.855545: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
WARNING:tensorflow:From train_latest_fizyr.py:541: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/2
2020-10-06 13:15:25.776950: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudnn64_7.dll
2020-10-06 13:15:28.004983: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
2020-10-06 13:15:28.121843: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
2020-10-06 13:15:29.193580: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.209383: I tensorflow/stream_executor/cuda/cuda_driver.cc:775] failed to allocate 858.70M (900412160 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2020-10-06 13:15:29.337869: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.363332: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-10-06 13:15:29.464090: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
...
2020-10-06 13:15:30.261915: W tensorflow/core/common_runtime/bfc_allocator.cc:246] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.16GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
  1/200 [..............................] - ETA: 0s - loss: 3.9458 - regression_loss: 2.8127 - classification_loss: 1.13312020-10-06 13:15:31.922292: I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session started.
WARNING:tensorflow:From C:\XXXXX\venv38\lib\site-packages\tensorflow\python\ops\summary_ops_v2.py:1277: stop (from tensorflow.python.eager.profiler) is deprecated and will be removed after 2020-07-01.
Instructions for updating:
use `tf.profiler.experimental.stop` instead.
2020-10-06 13:15:32.542621: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
2020-10-06 13:15:32.580191: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:223]  GpuTracer has collected 3193 callback api events and 3193 activity events.
2020-10-06 13:15:32.695250: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.734889: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for trace.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.trace.json.gz
2020-10-06 13:15:32.857585: I tensorflow/core/profiler/rpc/client/save_profile.cc:176] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32
2020-10-06 13:15:32.874147: I tensorflow/core/profiler/rpc/client/save_profile.cc:182] Dumped gzipped tool data for memory_profile.json.gz to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.memory_profile.json.gz
2020-10-06 13:15:32.901109: I tensorflow/python/profiler/internal/profiler_wrapper.cc:111] Creating directory: logs\train\plugins\profile\2020_10_06_11_15_32Dumped tool data for xplane.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.xplane.pb
Dumped tool data for overview_page.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.overview_page.pb
Dumped tool data for input_pipeline.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to logs\train\plugins\profile\2020_10_06_11_15_32\XXXX.kernel_stats.pb

  2/200 [..............................] - ETA: 1:41 - loss: 3.8811 - regression_loss: 2.7477 - classification_loss: 1.1334WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0590s vs `on_train_batch_end` time: 0.9618s). Check your callbacks.
Running network: 100% (165 of 165) |#########################################################################################################################################| Elapsed Time: 0:00:42 Time:  0:00:42
Parsing annotations: 100% (165 of 165) |#####################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
100 instances of class object1 with average precision: 0.0000
97 instances of class object2 with average precision: 0.0000
15 instances of class object3 with average precision: 0.0000
mAP: 0.0000

If there are any suggestions on what to try to increase my precision / troubleshoot why it isn't finding any objects, please let me know?

andrewboyes commented 4 years ago

After a few epochs it sometimes produces this error:

Epoch 00006: ReduceLROnPlateau reducing learning rate to 1.000000013351432e-11.
200/200 [==============================] - 267s 1s/step - loss: 2.8008 - regression_loss: 1.9139 - classification_loss: 0.8868
2020-10-07 12:23:17.121434: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
         [[{{node PyFunc}}]]
netilovefm1 commented 4 years ago

if u got a that error, try at that epoch checkpoint , not that first weight file ,

andrewboyes commented 4 years ago

I have tried with a previous version of fizyr/keras-retinanet, using tensorflow 1.14 and I have the same problem of the low mAP (0.00). I don't get the failed precondition anymore however. What could I try do to increase the mAP? How many epochs would one expect before seeing an improvement in the mAP? I am looking for light objects in a dark trunk. My dataset is roughly 1500 images, with 300 instances of each class/object (3 object classes). Which weight file should one use to initialize the network? During training, my learning rate automatically adjusts to below 1e-20 without any improvements in mAP.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale due to the lack of recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

rizwanriaz-se commented 1 year ago

Any one have solved it yet? I am also having this issue and none of the provided solutions works :(

sangeun-jo commented 11 months ago

I had same problem, but after I set batch-size option with class count, it solved. (I had 3 classes, so I set batch-size as 3)