ehw-fit / nascaps

A Framework for Neural Architecture Search to Optimize the Accuracy and Hardware Efficiency of Convolutional Capsule Networks
MIT License
5 stars 2 forks source link

Training of generated model (from the generated gene) gives tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed #4

Open amew0 opened 1 year ago

amew0 commented 1 year ago

I have been going over the implementation of the NASCaps repo, and to understand how the algorithm searches the architecture I am following the README.md there to run the "main.py" with its args as mentioned in the file. and I have encountered an issue explained down below: Once a gene is created and the corresponding CapsNet model is created, upon training the model for evaluating the population (method evaluate_population > wrap_train_test > train) I get the following error:

File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/engine/training.py", line 1217, in train_on_batch
    outputs = self.train_function(ins)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
    class_weight=class_weight)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/home/ak11263/nascaps/nsga/main.py", line 893, in train
    callbacks=[timeout_call, log, checkpoint, lr_decay])
  File "/home/ak11263/nascaps/nsga/main.py", line 652, in wrap_train_test
    runid, _ = train(model=model, data=((x_train_current, y_train), (x_test_current, y_test)), args=args)
  File "/home/ak11263/nascaps/nsga/main.py", line 525, in evaluate_population
    p["runid"], train_acc = wrap_train_test(p["gene"])
  File "/home/ak11263/nascaps/nsga/main.py", line 711, in run_NSGA2
    evaluate_population(parent)
  File "/home/ak11263/nascaps/nsga/main.py", line 1065, in <module>
    rets = run_NSGA2(metrics=["accuracy_drop", "energy", "memory", "latency"], inshape=inshape, p_size=args.population, q_size=args.offsprings, generations=args.generations)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ak11263/miniconda3/envs/tf-1.13-gpu/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
    "__main__", mod_spec)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(100, 160), b.shape=(160, 784), m=100, n=784, k=160
     [[{{node decoder/dense_1/MatMul}}]]
     [[{{node loss/decoder_loss/Mean_3}}]]

After disabling (commenting out) the training and testing of the generated model and replacing it with a dummy model to generate a random test_acc I have seen that the program runs successfully.

I have been looking around the net and have some suggestions that the use of tensorflow v1 is causing the issue (I also have seen that it has been showing me plenty of warnings of deprecations).

I also have started migrating the project into tensorflow 2, although not very successfully.

It would have been delightful if I could have been given any suggestions.

mrazekv commented 1 year ago

There was an issue in Tensorflow and in a different implementation of Capnets (and matmul function). You should use Tensorflow 1.13 (as specified in environment.yml) or rewrite the capsule layers to the newest version.

amew0 commented 1 year ago

Yeah, I first installed all the dependencies following the README.md file (so yes tensorflow v1.13 is installed)