fizyr / keras-retinanet

Keras implementation of RetinaNet object detection.
Apache License 2.0
4.38k stars 1.96k forks source link

Travis tests no longer pass. #1123

Open de-vri-es opened 5 years ago

de-vri-es commented 5 years ago

The travis tests no longer pass as can be seen here: https://travis-ci.org/fizyr/keras-retinanet/builds/592460276 .

At a glance, it seems to be a problem with an incompatible version of tensorflow, but I didn't dig very deep.

hgaiser commented 5 years ago

Tensorflow recently updated to 2.0, probably related :)

de-vri-es commented 5 years ago

Sounds plausible ^^

anuar12 commented 5 years ago

Any chance necessary changes will be made for TF2.0 support?

hgaiser commented 5 years ago

Yes I expect this week or next week I'll push an update to fix compatibility with tensorflow 2.0.

de-vri-es commented 5 years ago

If I try to run the tests locally on a machine without CUDA the tests also hang forever. In my case, all 8GB of RAM was exhausted and the whole computer froze.

Perhabs we have a memory leak with TF2 somewhere, or perhaps TF2 or its TF1 compatibility layer is doing something very inefficiently. I didn't try to pinpoint it to anything specific. I also don't think I'll have time for that anywhere soon.

hgaiser commented 5 years ago

all 8GB of RAM

You need an upgrade :)

I was testing only tests/models/test_mobilenet.py and it was using a lot of memory, but it did finish. Interestingly, adding the following:

import tensorflow as tf
tf.compat.v1.disable_v2_behavior()

Seems to make everything "normal". I don't want to add that code everywhere, but so far it's the only solution I have. I don't have much time to look into this issue further, but I will see if I can find out the underlying issue. In the meantime, @de-vri-es could you verify that disabling v2 behavior fixes the tests on your system too?

hgaiser commented 5 years ago

I feel like this could be related: https://github.com/tensorflow/tensorflow/issues/32052 . Sounds like there are some memory issues with tf2 and the travis build most likely gets killed because it uses too much memory. Maybe as a temporary fix we can disable the regression tests (the ones that actually train on a bit of data) and only test specific functions for now?

de-vri-es commented 5 years ago

all 8GB of RAM

You need an upgrade :)

It was actually 16 GB, though maybe I still should :)

Anyway, to set a baseline I tried running the tests again without any modifications, and python segfaulted in some multiprocessing operation :dancer:

tests/test_losses.py .                                                                                          [  1%]
tests/backend/test_common.py ..                                                                                 [  5%]
tests/bin/test_train.py Fatal Python error: Segmentation fault

Thread 0x00007ffa7d5fb700 (most recent call first):
  File "/usr/lib/python3.7/multiprocessinzsh: segmentation fault (core dumped)  pytest

Ran it again, no segfault, but memory usage crept up in tests/models/test_densenet.py until I killed it when it reached 14 GB. Then I added the change you suggested to that test, and the same test (also densenet) passed using "only" 4 GB of RAM.

However, then I copied the change into keras_retinanet/models/retinanet.py to get it everywhere, and the train.py tests started failing:

______________________________________________________ test_coco ______________________________________________________

    def test_coco():
        # ignore warnings in this test
        warnings.simplefilter('ignore')

        # run training / evaluation
        keras_retinanet.bin.train.main([
            '--epochs=1',
            '--steps=1',
            '--no-weights',
            '--no-snapshots',
            'coco',
>           'tests/test-data/coco',
        ])

tests/bin/test_train.py:44: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
keras_retinanet/bin/train.py:516: in main
    validation_data=validation_generator
/usr/lib/python3.7/site-packages/keras/legacy/interfaces.py:91: in wrapper
    return func(*args, **kwargs)
/usr/lib/python3.7/site-packages/keras/engine/training.py:1732: in fit_generator
    initial_epoch=initial_epoch)
/usr/lib/python3.7/site-packages/keras/engine/training_generator.py:260: in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
/usr/lib/python3.7/site-packages/keras/callbacks/callbacks.py:152: in on_epoch_end
    callback.on_epoch_end(epoch, logs)
/usr/lib/python3.7/site-packages/keras/callbacks/callbacks.py:1036: in on_epoch_end
    logs['lr'] = K.get_value(self.model.optimizer.lr)
/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2927: in get_value
    return x.numpy()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <tf.Variable 'Adam/learning_rate:0' shape=() dtype=float32>

    def numpy(self):
      if context.executing_eagerly():
        return self.read_value().numpy()
      raise NotImplementedError(
>         "numpy() is only available when eager execution is enabled.")
E     NotImplementedError: numpy() is only available when eager execution is enabled.

/usr/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:579: NotImplementedError

So it looks like we can't just do this for everything blindly.

de-vri-es commented 5 years ago

hmm, for some reason I'm now getting that error even without modifications. It may not be related. Besides that error, the test suite is passing with the suggested change added only to the densenet and mobilenet tests.

de-vri-es commented 5 years ago

Ah, I see, pytest isn't running each tests in a clean, isolated environment. It appears it already imported the densenet and mobilenet tests, which already called tensorflow.compat.v1.disable_v2_behavior(). So it is that call that's causing the train.py test to fail.

I glanced at pytest --help but I didn't immediately see an option to fork a clean python interpreter for each test.

hgaiser commented 5 years ago

Actually I think we had that a few commits ago with python-xdist but I thought at first that that was causing issues so I removed it to test in Travis but then it got merged a bit prematurely.

hgaiser commented 5 years ago

I put the forking of tests back, it stops the cumulating of memory usage, but it still doesn't work on travis (it does run fine locally for the record).

Also, on two comparable systems, but one running tensorflow 2 and the other running tensorflow 1.14, the times for test_mobilenet.py are quite different. 100s for tensorflow 2 and 46s for tensorflow 1.14 =\

Disabling eager execution in tensorflow reduces the time to 70s, disabling v2 behavior gives approximately the same time (47s) as 1.14.

Borda commented 5 years ago

So what is advantage of using FT 2.x?

hgaiser commented 4 years ago

So what is advantage of using FT 2.x?

It's going to be the version of tensorflow many people will use, so we have to support it. But the main advantage of tf 2 is eager execution (which is being annoying at the moment) and the move to tf.keras for their API.

Borda commented 4 years ago

according to experimenting in #1137 it is kind of magic circle, you disable v2 behaviour which contains eager execution which is needed for https://github.com/fizyr/keras-retinanet/issues/1123#issuecomment-541875513 and enabling this execution overflow memory and time limit... so we would need to disable the behaviour and replace .numpy() calling...

hgaiser commented 4 years ago

Yep you're right :)

I'm hoping https://github.com/keras-team/keras/pull/13476 will get merged soon, then we can just continue with disabling v2 behavior. Seems to me the most easy way forward.

Borda commented 4 years ago

Well, the time of being merged and being release can be different... 8-)

hgaiser commented 4 years ago

Well, the time of being merged and being release can be different... 8-)

Yes, but merging that will break usage when using tensorflow 2 :) At least now it should work with tensorflow 1.14/1.15.

Borda commented 4 years ago

should is not very convincing, I would merge #1137 even it not fully fixed, but it has tests for TF 1.x and 2.x so it is clear what passes and what not...

hgaiser commented 4 years ago

There's no point in merging it yet without having the fix in Keras to go with it.

Borda commented 4 years ago

let me clarify... what you expect from Keras fix, that the disable_v2_behavior will work fine with .numpy(), right? so then you do not expect any other change on side of this repo... so the mentioned PR after dropping enable_eager_execution is what should be the final solution, right? Does it also mean that until Keras is fixed you do not allow merge any other PR because there is working CI to check if everything is running correctly?

ThanhNhann commented 4 years ago

Excuse me, when I read .travis.yml I see the line travis_wait coverage run --source keras_retinanet -m py.test keras_retinanet tests --doctest-modules --forked --flake8 when I run, Its says travis_wait: command not found and I also find travis_wait but don't see, can you explain to me about this. Thanks!

Borda commented 4 years ago

travis_wait is a pure Travis command, which does not work elsewhere... in the end, it does not have much effect, so you can omit it completely :] there a work-around in this PR - https://github.com/fizyr/keras-retinanet/pull/1137

ThanhNhann commented 4 years ago

Thanks for the quick reply @Borda, when I run the file train.py, I met this problem File "keras_retinanet/bin/train.py", line 528, in main() File "keras_retinanet/bin/train.py", line 521, in main validation_steps = args.steps_for_validation, AttributeError: 'Namespace' object has no attribute 'steps_for_validation' I dont see another steps_for_validation in train.py, can you explain for this ? Thanks

Borda commented 4 years ago

You are probably missing a paremeter steps_for_validation while you call the training script... So there may be a bug since it should be mandatory paraneter or with a default value...

ThanhNhann commented 4 years ago

When I find the parameter steps_for_validation in the train.py, it's just called but not defined. Do you think this is an omission?

pendex900x commented 4 years ago

same error here

/usr/local/lib/python3.6/dist-packages/keras/callbacks/tensorboard_v2.py:92: UserWarning: The TensorBoard callback batch_size argument (for histogram computation) is deprecated with TensorFlow 2.0. It will be ignored. warnings.warn('The TensorBoard callback batch_size argument ' Traceback (most recent call last): File "/content/keras-retinanet/keras_retinanet/bin/train.py", line 528, in main() File "/content/keras-retinanet/keras_retinanet/bin/train.py", line 521, in main validation_steps = args.steps_for_validation, AttributeError: 'Namespace' object has no attribute 'steps_for_validation'

hgaiser commented 4 years ago

I just reverted some commits, can you try again?

Note: You still need https://github.com/keras-team/keras/pull/13476 to pass the tests and in case you are using tensorflow 2.1, you also need https://github.com/tensorflow/tensorflow/pull/34870 (or remove Tensorboard).

SalahAdDin commented 4 years ago

What's about the next warning? warnings.warn('The TensorBoard callback batch_size argument '