aurora95 / Keras-FCN

Keras-tensorflow implementation of Fully Convolutional Networks for Semantic Segmentation(Unfinished)
MIT License
650 stars 268 forks source link

slow training progress #17

Closed ahundt closed 7 years ago

ahundt commented 7 years ago

I'm running with the current master and I'm not seeing the performance described in https://github.com/aurora95/Keras-FCN/issues/4, perhaps something is up with my converted dataset?


11127/11127 [==============================] - 7896s - loss: 1.3891 - sparse_accuracy_ignoring_last_label: 0.6165
lr: 0.009964
Epoch 2/250
11127/11127 [==============================] - 7972s - loss: 1.0751 - sparse_accuracy_ignoring_last_label: 0.6326
lr: 0.009928
Epoch 3/250
11127/11127 [==============================] - 7937s - loss: 1.0529 - sparse_accuracy_ignoring_last_label: 0.6385
lr: 0.009892
Epoch 4/250
11127/11127 [==============================] - 7878s - loss: 1.0487 - sparse_accuracy_ignoring_last_label: 0.6407
lr: 0.009856
Epoch 5/250
11127/11127 [==============================] - 7915s - loss: 1.0411 - sparse_accuracy_ignoring_last_label: 0.6434
lr: 0.009820
Epoch 6/250
11127/11127 [==============================] - 7849s - loss: 1.0374 - sparse_accuracy_ignoring_last_label: 0.6447
lr: 0.009784
Epoch 7/250
11127/11127 [==============================] - 7843s - loss: 1.0358 - sparse_accuracy_ignoring_last_label: 0.6448
lr: 0.009748
Epoch 8/250
 6808/11127 [=================>............] - ETA: 3041s - loss: 1.0342 - sparse_accuracy_ignoring_last_label: 0.6447

Also training is taking a lot longer than I imagined with around 2 hours per epoch, is that typical with the full 11k images from pascal voc + the berkeley dataset? I'm running on a GTX1080 with a batch size of 16 and the files are stored on an HDD, not an SSD, though theoretically linux does some caching for this sort of thing and it could all fit in my 48GB of system ram.

ahundt commented 7 years ago

For reference that means these parameters (and SGD is the optimizer):

    model_name = 'AtrousFCN_Resnet50_16s'
    batch_size = 16
    batchnorm_momentum = 0.95
    epochs = 250
    lr_base = 0.01 * (float(batch_size) / 16)
    lr_power = 0.9
    resume_training = False
    if model_name is 'AtrousFCN_Resnet50_16s':
        weight_decay = 0.0001/2
    else:
        weight_decay = 1e-4
    classes = 21
    target_size = (320, 320)
    dataset = 'VOC2012_BERKELEY'
aurora95 commented 7 years ago

To be honest I didn't test the new code...

But my old code running on an old Titan X takes about 600-700s per epoch for AtrousFCN_Resnet50_16s, batch_size=16, target_size=(320, 320) on the 11k dataset. Also the accuracy after the first epoch should be 0.78 or something around. So you must have big problems...

ahundt commented 7 years ago

Oh man so silly... the branch on my laptop was different from my training workstation, sorry about that. The numbers look like you describe now once I checked out the right branch and enabled SGD.

 1398/11127 [==>...........................] - ETA: 6660s - loss: 1.0014 - sparse_accuracy_ignoring_last_label: 0.7451

For some reason epochs are still super slow for me but that's probably particular to my machine.