Troubles while testing/training

DaddyWesker commented 6 years ago

Hello! i've met some problems during testing or training descs.

Trying to train and meeting following pproblem: https://pp.userapi.com/c824600/v824600054/12566e/u62K0nXm3s0.jpg

Trying to test using pre-trained models from readme's link: https://pp.userapi.com/c824600/v824600054/125682/ZlHsuOpSbtM.jpg (yes, there are no such file nor in modules, nor in pre-trained model)

So, am i doing something wrong or it's just some sort of compatibility issue? Using Win 10 x64. Thanks in advance!

etrulls commented 6 years ago

Hi Daddy,

For the first problem: we assume that the user provides some files generated from an SfM reconstruction (or something similar). It's not properly documented but easy enough to understand from the code. The file in question is just a histogram of the SIFT scales from the training data. You'll encounter more of these.

For the second problem: that's for and old variant for the descriptor that we don't use anymore, I just commented out the lines and pushed.

Best, E.

DaddyWesker commented 6 years ago

So, i believe, second problem is done now?

About first problem. Where to find these "SfM reconstruction generated files"?

Thanks in advance

etrulls commented 6 years ago

Problem 2: Yes.

Problem 1: It's not as direct as that. Basically we use SfM to obtain a list of patches (SIFT keypoints) that correspond across different images, plus some additional information such as the scale histogram you saw and probably one or two more things I can't remember now. datasets/eccv2016/eccv.py then processes this information into dumps which are used for training. Most of it is self-explanatory, but it's provided without proper documentation for now.

Best, E.

DaddyWesker commented 6 years ago

Okay, so do you have these files and can you send them? Or they are too big to upload? Or maybe there are some sort of instruction somewhere? Or it's "to be done" right now?

Will try to launch pre-trained models and will see what happens now, thanks

DaddyWesker commented 6 years ago

About testing, now i'm getting following problem: https://pp.userapi.com/c834201/v834201267/124c79/gOK03l_xJsg.jpg (is it okay that i'm sending you pics instead of code?)

kmyi commented 6 years ago

Which version of TensorFlow are you using?

DaddyWesker commented 6 years ago

1.5.0

kmyi commented 6 years ago

I have no clear idea then. I highly doubt it, but could it really be a windows thing? Can you give the exact command line to reproduce this?

DaddyWesker commented 6 years ago

I tried following python main.py --task=test --subtask=kp --logdir=logs/test --test_img_file=DSC_0275.jpg --test_out_file=image1_kp.txt actually, same thing as in readme but with another input file. Then i read "pre-trained models" part again and realized maybe i need to add --use_batch_norm=False --mean_std_type=dataset to the end of this line, but it gave me another error

https://pp.userapi.com/c845221/v845221218/3c490/-tiXNUb5EUg.jpg

I guess that means Dataset havent loaded. Where i need to put them? My folder looks like this right now: https://pp.userapi.com/c845221/v845221218/3c4bc/Bcu0N6l9dS8.jpg

So i've just unrar'ed both folders to main folder. Is that right?

etrulls commented 6 years ago

We instantiate the dataset class as a conveniency to load up the mean/std IIRC from the log files, but there's no explicit dataset for testing. Your command works for me. The only weird thing I see, looking at the screenshots, is that your logdir should be e.g. release-no-aug, which is the folder that contains the models. But maybe you moved it there, as there isn't a logs/test on your second screenshot? I don't know.

DaddyWesker commented 6 years ago

So, it has to be like this? python main.py --task=test --subtask=kp --logdir=release-no-aug --test_img_fil e=DSC_0275.jpg --test_out_file=image1_kp.txt --use_batch_norm=False --mean_std _type=dataset

Using this line i'm getting following: https://pp.userapi.com/c840332/v840332519/7f36f/fT0iR6YItYw.jpg

etrulls commented 6 years ago

Yeah, that command looks ok. We saw that error yesterday and we have no idea what it means, as IIRC that's before loading the weights? Could it be a windows problem naming variables?

You should get something like this:

network/lift/kp/conv-ghh-1/weights:0 (float32_ref 25x25x1x16) [10000, bytes: 40000]
network/lift/kp/conv-ghh-1/biases:0 (float32_ref 16) [16, bytes: 64]
network/lift/desc/conv-act-pool-norm-1/weights:0 (float32_ref 7x7x1x32) [1568, bytes: 6272]
network/lift/desc/conv-act-pool-norm-1/biases:0 (float32_ref 32) [32, bytes: 128]
network/lift/desc/conv-act-pool-norm-2/weights:0 (float32_ref 6x6x32x64) [73728, bytes: 294912]
network/lift/desc/conv-act-pool-norm-2/biases:0 (float32_ref 64) [64, bytes: 256]
network/lift/desc/conv-act-pool-3/weights:0 (float32_ref 5x5x64x128) [204800, bytes: 819200]
network/lift/desc/conv-act-pool-3/biases:0 (float32_ref 128) [128, bytes: 512]
Total size of variables: 290336
Total bytes of variables: 1161344

DaddyWesker commented 6 years ago

well, i'm planning on trying to launch your code on Ubuntu probably at Friday. Will see if it helps.

qasimmehboob commented 6 years ago

I am getting the same error on Ubuntu 16.04

screenshot from 2018-05-01 23-51-36

etrulls commented 6 years ago

I just downloaded the repo from scratch, ran the same command with tensorflow 1.4 (which is still what I'm using for this project) and 1.5, and it works without a hitch for both of them.

It seems that it's crashing when creating the tf saver, here: https://github.com/cvlab-epfl/tf-lift/blob/5530e47dcbb13f1d5642477834c8105c26957f25/tester.py#L87-L88 but I can't see why. Can you put a breakpoint in there and try to figure out what's going on?

BorisMansencal commented 6 years ago

I am on Ubuntu 16.04.4 with tensorflow 1.6.0, CUDA 9.0 et CUDNN 7.0.5, and I have the same error:

Traceback (most recent call last):
  File "tf-lift/main.py", line 108, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "tf-lift/main.py", line 77, in main
    task = Tester(config, rng)
  File "/home/mansenca/tf-lift/tester.py", line 88, in __init__
    self.saver[_key] = tf.train.Saver(self.network.allparams[_key])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1293, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1302, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1339, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 774, in _build_internal
    saveables = self._ValidateAndSliceInputs(names_to_saveables)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 627, in _ValidateAndSliceInputs
    names_to_saveables = BaseSaverBuilder.OpListToDict(names_to_saveables)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 592, in OpListToDict
    name)
ValueError: At least two variables have the same name: network/lift/kp/conv-ghh-1/biases

In tester.py, line 88, with _key="joint", we have self.network.allparams[_key] that is:

[<tf.Variable 'network/lift/kp/conv-ghh-1/weights:0' shape=(25, 25, 1, 16) dtype=float32_ref>,
 <tf.Variable 'network/lift/kp/conv-ghh-1/biases:0' shape=(16,) dtype=float32_ref>,
 <tf.Variable 'network/lift/kp/conv-ghh-1/weights:0' shape=(25, 25, 1, 16) dtype=float32_ref>,
 <tf.Variable 'network/lift/kp/conv-ghh-1/biases:0' shape=(16,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-1/weights:0' shape=(7, 7, 1, 32) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-1/biases:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-1/batch_normalization/gamma:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-1/batch_normalization/beta:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-1/batch_normalization/moving_mean:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-1/batch_normalization/moving_variance:0' shape=(32,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-2/weights:0' shape=(6, 6, 32, 64) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-2/biases:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-2/batch_normalization/gamma:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-2/batch_normalization/beta:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-2/batch_normalization/moving_mean:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-norm-2/batch_normalization/moving_variance:0' shape=(64,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-3/weights:0' shape=(5, 5, 64, 128) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-3/biases:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-3/batch_normalization/gamma:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-3/batch_normalization/beta:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-3/batch_normalization/moving_mean:0' shape=(128,) dtype=float32_ref>,
 <tf.Variable 'network/lift/desc/conv-act-pool-3/batch_normalization/moving_variance:0' shape=(128,) dtype=float32_ref>]

tensorflow 1.6 seems to to complain because variables network/lift/kp/conv-ghh-1/biases and network/lift/kp/conv-ghh-1/weights are present two times ?

etrulls commented 6 years ago

Yes, that's exactly what's going on, but I don't see that on my end, and I'm not sure how it could happen...

BorisMansencal commented 6 years ago

Did you try with tensorflow 1.6.0 ?

Chenhhui commented 6 years ago

I got the same error "ValueError: At least two variables have the same name: network/lift/kp/conv-ghh-1/biases" on Ubuntu16.04, Cuda9.0, tensorflow1.8.0. Then I change tensorflow 1.8.0 to tensorflow 1.5.0, still get the same error. Thank you for your help!

BorisMansencal commented 6 years ago

FYI, on Ubuntu 16.04.4 with tensorflow 1.4.0, CUDA 8.0 et CUDNN 6.0.21, it seems to work correctly.

etrulls commented 6 years ago

Mmm! We are indeed using cuda 8, as we haven't installed 9 yet. Could i be a bug with certain combinations of tf/cuda?

On Fri, May 4, 2018 at 4:18 PM, Boris Mansencal notifications@github.com wrote:

FYI, on Ubuntu 16.04.4 with tensorflow 1.4.0, CUDA 8.0 et CUDNN 6.0.21, it seems to work correctly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cvlab-epfl/tf-lift/issues/16#issuecomment-386615926, or mute the thread https://github.com/notifications/unsubscribe-auth/AFCu2_-4yBxEJ_M6a1baoQ6ul8hvYbSfks5tvGMfgaJpZM4TrXyd .

nrupatunga commented 6 years ago

Any idea why I get this error when we run the following command: python main.py --task=test --subtask=ori --logdir=logs/test --test_img_file=image1.jpg \ --test_out_file=image1_ori.txt --test_kp_file=image1_kp.txt



Traceback (most recent call last):
  File "main.py", line 102, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 73, in main
    task.run()
  File "/home/whodat/tf-lift/tester.py", line 109, in run
    restore_res = restore_network(self, subtask)
  File "/home/whodat/tf-lift/utils/dump.py", line 150, in restore_network
    is_loaded += load_network(supervisor, subtask, predir)
  File "/home/whodat/tf-lift/utils/dump.py", line 169, in load_network
    latest_checkpoint
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1666, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key network/lift/ori/conv-act-pool-1/batch_normalization/moving_mean not found in checkpoint
     [[Node: save_1/RestoreV2_2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_2/tensor_names, save_1/RestoreV2_2/shape_and_slices)]]
     [[Node: save_1/RestoreV2_2/_51 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_114_save_1/RestoreV2_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

Caused by op 'save_1/RestoreV2_2', defined at:
  File "main.py", line 102, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 70, in main
    task = Tester(config, rng)
  File "/home/whodat/tf-lift/tester.py", line 88, in __init__
    self.saver[_key] = tf.train.Saver(self.network.allparams[_key])
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1218, in __init__
    self.build()
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1227, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1263, in _build
    build_save=build_save, build_restore=build_restore)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 751, in _build_internal
    restore_sequentially, reshape)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 427, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 267, in restore_op
    [spec.tensor.dtype])[0])
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py", line 1021, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/whodat/tf-lift/tf-lift/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): Key network/lift/ori/conv-act-pool-1/batch_normalization/moving_mean not found in checkpoint
     [[Node: save_1/RestoreV2_2 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save_1/Const_0_0, save_1/RestoreV2_2/tensor_names, save_1/RestoreV2_2/shape_and_slices)]]
     [[Node: save_1/RestoreV2_2/_51 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_114_save_1/RestoreV2_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

nrupatunga commented 6 years ago

Above error is solved by adding the options -use_batch_norm=False --mean_std_type=dataset

python main.py --task=test --subtask=ori --logdir=logs/test --test_img_file=image1.jpg --test_out_file=image1_ori.txt --test_kp_file=image1_kp.txt --use_batch_norm=False --mean_std_type=dataset

luczeng commented 6 years ago

Got the same issue. Is the only solution to go back to tensorflow 1.4 ? (which means going back to CUDA 8 I guess)

kmyi commented 5 years ago

Yes for now the solution is to stick to TF 1.4. We don't have time to update this repo to support newer versions.

wenwenzju commented 5 years ago

I think I've solved the problem "ValueError: At least two variables have the same name: network/lift/kp/conv-ghh-1/biases". In lift.py, at lines 483 and 492, module kp is built twice (Maybe a bug? I don't know why). So in _build_module, self.allparams["joint"] plus self.allparams['kp'] twice, which causes the above problem. The following is my solution, and I use tf=1.8.0. for modu in self.allparams[module]: if modu not in self.allparams["joint"]: self.allparams["joint"].append(modu)

kmyi commented 5 years ago

@wenwenzju Sounds like it could indeed be a solution. Could you make a Pull Request for that maybe?

punisher220 commented 4 years ago

Excuse me, @wenwenzju I wonder a small question: in which line to add the code:

for modu in self.allparams[module]: if modu not in self.allparams["joint"]: self.allparams["joint"].append(modu)

I tried with wrong locations and the error still occurred. Thank you.

json87 commented 4 years ago

@kmyid I have revised these lines according to the tips mentioned by @wenwenzju, and solved the bugs for tf 1.14 with cuda 10.1 under windows platform.

network/lift.py, line 648

            if is_first:
                self.params[module] = tf.get_collection(
                    tf.GraphKeys.TRAINABLE_VARIABLES, scope=sc.name)
                self.allparams[module] = tf.get_collection(
                    tf.GraphKeys.GLOBAL_VARIABLES, scope=sc.name)
                # Also append to the global list

                # Edit by San Jiang (jiangsan@cug.edu.cn)
                for modu in self.params[module]:
                    if modu not in self.params["joint"]:
                        self.params["joint"].append(modu)
                for modu in self.allparams[module]:
                    if modu not in self.allparams["joint"]:
                        self.allparams["joint"].append(modu)

                # Mark that it is initialized
                is_first = False

punisher220 commented 4 years ago

@json87 Thanks for your help. But I did the modification as yours in tf-1.15 with CUDA 10.0 under WINDOWS10 system but failed.

json87 commented 4 years ago

@punisher220 I have passed the first step under the above-mentioned modification. python main.py --task=test --subtask=kp --logdir=logs/test --test_img_file=image1.jpg \ --test_out_file=image1_kp.txt

The result is shown below.

However, I still cannot pass the second step. python main.py --task=test --subtask=ori --logdir=logs/test --test_img_file=image1.jpg \ --test_out_file=image1_ori.txt --test_kp_file=image1_kp.txt

punisher220 commented 4 years ago

@json87 When I succeeded in testing in tf1.4 with CUDA8.0 in Ubuntu18.04 system I used the command as:

python main.py --task=test --subtask=kp --logdir=release-no-aug --test_img_file=Test/ucsb1.jpg --test_out_file=Test/ucsb1_kp.txt --use_batch_norm=False

python main.py --task=test --subtask=ori --logdir=release-no-aug --test_img_file=Test/ucsb1.jpg --test_out_file=Test/ucsb1_ori.txt --test_kp_file=Test/ucsb1_kp.txt --use_batch_norm=False

python main.py --task=test --subtask=desc --logdir=release-no-aug --test_img_file=Test/ucsb1.jpg --test_out_file=Test/ucsb1_desc.h5 --test_kp_file=Test/ucsb1_ori.txt --use_batch_norm=False

At present, I am trying to finish the train and test part in another device with WINDOWS10, tf1.15 and CUDA10.0 but got stuck.

punisher220 commented 4 years ago

@json87 I used your modification and succeeded in training of desc part with WINDOWS10, tf1.13.1 and CUDA10.0. But I have not tried the ori part and kp part training yet. I also have not tried test part.

punisher220 commented 4 years ago

@json87 I have met a problem in training of ORI subtask after finishing DESC part with your modifications under WINDOWS10, tf1.13.1 and CUDA10.0 It is as follows:

Subtask = ori: 0%| | 0/100000000 [00:00<?, ?it/s]2020-08-07 08:35:51.485198: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library cublas64_100.dll locally

2020-08-07 08:35:51.661068: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

2020-08-07 08:35:51.661270: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

2020-08-07 08:35:51.672628: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

2020-08-07 08:35:51.672777: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

2020-08-07 08:35:51.684399: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

2020-08-07 08:35:51.684580: E tensorflow/stream_executor/cuda/cuda_blas.cc:510] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

2020-08-07 08:35:51.684777: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support

2020-08-07 08:35:51.684778: W tensorflow/stream_executor/stream.cc:2130] attempting to perform BLAS operation using StreamExecutor without BLAS support

Traceback (most recent call last): File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call return fn(*args) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[128,2,3], b.shape=[128,3,4096], m=2, n=4096, k=3, batch_size=128 [[{{node network/lift/crop_1/SpatialTransformer/_transform/MatMul}}]] [[{{node loss/desc-pair/Sqrt}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 98, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "main.py", line 70, in main task.run() File "D:\Host\Available_Code\trainer.py", line 142, in run cur_loss = self.network.forward(subtask, cur_data) File "D:\Host\Available_Code\networks\lift.py", line 182, in forward res = self.sess.run(fetch, feed_dict=feed_dict) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 929, in run run_metadata_ptr) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run run_metadata) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: Blas xGEMMBatched launch failed : a.shape=[128,2,3], b.shape=[128,3,4096], m=2, n=4096, k=3, batch_size=128 [[node network/lift/crop_1/SpatialTransformer/_transform/MatMul (defined at D:\Host\Available_Code\modules\spatial_transformer.py:176) ]] [[node loss/desc-pair/Sqrt (defined at D:\Host\Available_Code\losses.py:112) ]]

Caused by op 'network/lift/crop_1/SpatialTransformer/_transform/MatMul', defined at: File "main.py", line 98, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run _sys.exit(main(argv)) File "main.py", line 59, in main task = Trainer(config, rng) File "D:\Host\Available_Code\trainer.py", line 72, in init self.network = Network(self.sess, self.config, self.dataset) File "D:\Host\Available_Code\networks\lift.py", line 138, in init self._build_network() File "D:\Host\Available_Code\networks\lift.py", line 523, in _build_network float(get_patch_size(self.config)), File "D:\Host\Available_Code\networks\lift.py", line 708, in _build_st out_size=(out_size, out_size), File "D:\Host\Available_Code\modules\spatial_transformer.py", line 191, in transformer output = _transform(theta, U, out_size) File "D:\Host\Available_Code\modules\spatial_transformer.py", line 176, in _transform T_g = tf.matmul(theta, grid) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2417, in matmul a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1483, in batch_mat_mul "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func return func(*args, **kwargs) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op op_def=op_def) File "C:\Users\XXX\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[128,2,3], b.shape=[128,3,4096], m=2, n=4096, k=3, batch_size=128 [[node network/lift/crop_1/SpatialTransformer/_transform/MatMul (defined at D:\Host\Available_Code\modules\spatial_transformer.py:176) ]] [[node loss/desc-pair/Sqrt (defined at D:\Host\Available_Code\losses.py:112) ]]

Seems that the remaining errors lie in the ORI subtask part but it is complex to make it compilable.

punisher220 commented 4 years ago

Well, finally, I found out that the problem means that the available GPU memory is not enough because other codes were running to take up GPU. (I also can decrease the batch size to make the GPU consuming less to run but I have not tried yet)

I tested again when the device is free and @json87 Your modifications works fine through all of the process including train and test. There will be some warnings but the result seems to be fine. Thanks for your advice! FYI, It is available in CUDA10.0, Cudnn7.6.0 and tf1.13.1 in Windows10 .

cvlab-epfl / tf-lift

Troubles while testing/training #16