PavlosMelissinos / enet-keras

A keras implementation of ENet (abandoned for the foreseeable future)
MIT License
115 stars 46 forks source link

Bad results - Investigate reason #11

Open PavlosMelissinos opened 7 years ago

PavlosMelissinos commented 7 years ago
Metric IoU area maxDets Result
Average Precision 0.50:0.95 all 100 0.001
Average Precision 0.50 all 100 0.004
Average Precision 0.75 all 100 0.000
Average Precision 0.50:0.95 small 100 0.000
Average Precision 0.50:0.95 medium 100 0.000
Average Precision 0.50:0.95 large 100 0.004
Average Recall 0.50:0.95 all 1 0.005
Average Recall 0.50:0.95 all 10 0.005
Average Recall 0.50:0.95 all 100 0.005
Average Recall 0.50:0.95 small 100 0.000
Average Recall 0.50:0.95 medium 100 0.001
Average Recall 0.50:0.95 large 100 0.019

This is using the official mscoco script.

Setup as: full image as input, each pixel gets classified using a one hot vector with a size of 81, 0 to 80 inclusive, that correspond to the actual category ids in MS-COCO. More specifically, index 0 is background, ..., index 12 corresponds to class id 13 (stop sign), ..., and index 80 is in fact class 90 (toothbrush). Output is the full image, not a crop. Then a script is used to separate the pixels of each detected object. No classes were used in the evalCOCO.py script (useCats = False).

These are really bad scores, and at the moment I have no idea why it's like that. I'll push the changes soon.

Which script do you use for evaluation @athundt ? If you have a working version maybe I should just replace mine with it. Does this work for mscoco?

ahundt commented 7 years ago

Sadly so far my mscoco results in Keras-FCN were no good. The loss function went negative with the full image segmentation and one_hot encoding, then I ran out of time to investigate the details.

PavlosMelissinos commented 7 years ago

That's too bad... FCNSS doesn't report performance on MS-COCO iirc but in the PSPNet paper, some IoU results are mentioned on page 8 (Table 6) and they seem to be pretty decent.

ghost commented 7 years ago

Hi guys, So, have you been able to reproduce the ENet results from the paper? Best, D.

PavlosMelissinos commented 7 years ago

Not really, sorry. It does converge but not well enough.

In the paper, the encoder is pretrained on ImageNet and the full pipeline is then fine-tuned on Cityscapes, CamVid and Sun RGB-D. However, I haven't set them up yet so I've only trained the network on MS-COCO (which often gives awful results). I'd like to finish the project at some point but I've had to move on to other stuff so at the moment I don't have the resources to do it properly, unfortunately. :(

ghost commented 7 years ago

No, worries I'll pick it up from here and see what's the problem. Not sure will be allowed to share code though.

ahundt commented 7 years ago

There has been a bugfix in densenet that solved some problems so it might work better now!

https://github.com/farizrahman4u/keras-contrib/blob/master/keras_contrib/applications/densenet.py

jmtatsch commented 7 years ago

@ahundt can you elaborate further how the densenet fix may be applicable to enet-keras? it seems as if the main gradient flow and the pooling indices are connected properly or am I missing something?

ahundt commented 7 years ago

@jmtatsch Sorry my post is totally irrelevant I must have mixed up tabs on my browser or something.

ghost commented 7 years ago

Hi guys, So it seems that I have probably managed to successfully retrain ENet on our own dataset by loading pretrained weights from torch and using adadelta (adagrad didn't work as well for me). You can load the weights from the torch model with torchfile. One more minor thing which shouldn't make much difference is that I added batch norm after the initial layer, following the paper, which I think was not there in the code. Anyway, I need to investigate a bit more the per-class accuracy and will get back to you. Best, D.

jmtatsch commented 7 years ago

@dkorkino the PReLu also seems to be missing as compared to https://github.com/e-lab/ENet-training/blob/master/train/models/encoder.lua#L86 Could you maybe publish the converted weights?

PavlosMelissinos commented 7 years ago

You're both right, @ghost and @jmtatsch. I also noticed a division bug in MaxPoolingWithArgmax2D that resulted in unwanted behavior on python 3 and another in the data generator. All three should be fixed now but let me know about any problems you might encounter.

Thanks a lot for the feedback 👍.

Sorry for taking this long to tackle the issue but I'd been on vacation until yesterday.

ColdCodeCool commented 7 years ago

@dkorkino @jmtatsch I am also looking forward to the release of the converted weights.

PavlosMelissinos commented 7 years ago

Does anyone have any idea why it takes so long to train?

I'm getting something like 25K seconds per epoch on MS-COCO (~80K samples) on a K40 for input dimensions of 256x256.

That amounts to ~0.3s per sample, so let's say about 10 fps for just the forward pass. That's much slower than the reported performance (135.4 fps for 640x360 on a Titan X)

I used to think it might be due to preprocessing but it actually only takes a fraction of that time. Any thoughts?

ahundt commented 7 years ago

Keras spends a lot of time with an empty gpu. There are collectively quite a few reasons, some of which are discussed in https://github.com/fchollet/keras/issues/6928. Putting things into a tfrecord, using #6928 and using the TF staging areas could help.

Alternately, there are some ways to do it with tensorflow proper, but there aren't great public examples aside from https://www.tensorflow.org/performance/performance_models, which is a bit convoluted.

PavlosMelissinos commented 7 years ago

That's a bummer to the extent it's true, I'd rather it was 100% my own mistake. There's definitely room for improvement in my implementation (still waiting for training to finish but judging by the progression of the loss, I don't expect the results to be much better than the current ones), however speed is an issue that hinders prototyping and evaluation, especially when this network takes more than 10x as much time as it should to train, and I'm not sure what I could do to fix it.

I've monitored the utilization of the GPU and it's not that low though, maybe that's not always such a big deal?

I'll check out the available solutions when I find some time, thanks @ahundt .

ahundt commented 7 years ago

It will definitely vary a bunch by use case and your physical hardware. For example if you've got a titan x but no super fast SSD I don't think it will be feasible to train 135fps. Wouldn't that figure most likely be with 8x titan x devices?

PavlosMelissinos commented 7 years ago

Hmm, I don't think they used multiple gpus to get that number, because the authors report 10x-20x better performance than segnet and my results with segnet are comparable to the ones they mention.

On Aug 26, 2017 01:38, "Andrew Hundt" notifications@github.com wrote:

It will definitely vary a bunch by use case and your physical hardware. For example if you've got a titan x but no super fast SSD I don't think it will be feasible to train 135fps. Wouldn't that figure most likely be with 8x titan x devices?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PavlosMelissinos/enet-keras/issues/11#issuecomment-325049315, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXtwcq6BkExPP3qVtU8Q571zrNJCf3Pks5sb0zYgaJpZM4Nia9P .

CC r a

PavlosMelissinos commented 7 years ago

@jmtatsch @ColdCodeCool @ghost @ahundt Good news, everyone! I've pushed a new commit that adds weight transfer capabilities. All you have to do is:

  1. download the trained model and put it in the models/pretrained directory within the enet-keras project.

  2. Run from_torch.py to retrieve the actual weights and put them in a pickle file.

  3. Train/finetune/predict as usual (the model will read the file if it exists, otherwise you'll get a ENet has found no compatible pretrained weights! Skipping weight transfer... message).

Any questions/comments/criticism are welcome as always :)

PavlosMelissinos commented 7 years ago

Haven't tried to train the network yet but I'll let you know how it goes when I do.

ahundt commented 6 years ago

@PavlosMelissinos Hey I was looking through your latest version, and perhaps I misunderstood what I read, but have you considered changing your loss function when training from scratch?

Something like these may be necessary for segmentation: https://github.com/theduynguyen/Keras-FCN/blob/master/loss_func.py

PavlosMelissinos commented 6 years ago

The main problem is that it doesn't work well enough even with the pretrained weights.

However, crossentropy without bg seems interesting and it might be what I need, thanks. I'll check it out!

ahundt commented 6 years ago

I added some segmentation metrics and losses: https://github.com/keras-team/keras-contrib/pull/197