PacktPublishing / Mastering-Computer-Vision-with-TensorFlow-2.0

Mastering Computer Vision with TensorFlow 2.0, published by Packt
MIT License
91 stars 68 forks source link

Chapter 04 - InvalidArgumentError: logits and labels must be broadcastable: logits_size=[10,3] labels_size=[10,5] #14

Closed duffjay closed 3 years ago

duffjay commented 3 years ago

I purchased the book on Amazon on Dec 8, 2020 Amazon.com order number: D01-5264632-9657808

I'm running on AWS Sagemaker, kernel = conda_amazonei_tensorflow2_p36; tensorflow-gpu 2.0.3 p3 instance type (GPU has 16 GB). This issue looks the same as #7 but I haven't gotten that far in the book - this is chapter 4

I tried custom model & VGG16 model, same error on running this code:

history = model.fit(train_generator,epochs=NUM_EPOCHS,steps_per_epoch=num_train_images // batchsize,validation_data=val_generator, validation_steps=num_val_images // batchsize)

I got this error:

InvalidArgumentError: logits and labels must be broadcastable: logits_size=[10,3] labels_size=[10,5] [[node loss/dense_2_loss/softmax_cross_entropy_with_logits (defined at /home/ec2-user/anaconda3/envs/amazonei_tensorflow2_p36/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_distributed_function_1831]

Function call stack: distributed_function

duffjay commented 3 years ago

I checked my input values to make sure it's reading the input data correctly. I inserted this code before the history = code

print (type(train_generator)) print (NUM_EPOCHS) print (num_train_images) print(batchsize) print (type(val_generator)) print (num_val_images) print (batchsize)

output:

<class 'keras_preprocessing.image.directory_iterator.DirectoryIterator'> 10 2700 10 <class 'keras_preprocessing.image.directory_iterator.DirectoryIterator'> 300 10

duffjay commented 3 years ago

I also checked the train_generator:

print (train_generator.__len__()) print (train_generator.__getitem__(3))

PARTIAL output:

403 (array([[[[ 63.68357 , 77.843575 , 84.94257 ], [ 60.5767 , 74.7367 , 81.8357 ], [ 61.6838 , 75.8438 , 82.9428 ], ..., [ -0.9137573, 14.246239 , 42.345238 ], [ 0.2521057, 15.412117 , 43.511116 ], [ 1.4179764, 16.57798 , 44.67698 ]],

the val_generator looked comparable

So, I'm assuming my inputs are all valid.

duffjay commented 3 years ago

Also tried tensorflow-gpu version 2.1 and 2.4. With 2.4 error is slightly different:

~/anaconda3/envs/tensorflow2_p36/lib/python3.6/site-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError:  logits and labels must be broadcastable: logits_size=[10,3] labels_size=[10,5]
     [[node categorical_crossentropy/softmax_cross_entropy_with_logits (defined at <ipython-input-16-6d9053c0f087>:1) ]] [Op:__inference_train_function_1491]

Function call stack:
train_function
duffjay commented 3 years ago

please let me know your thoughts on this error - I want to leave a positive review on Amazon for you but at the moment - the examples don't work. Using AWS SageMaker is a good way have a known consistent environment.

KrishkarPackt commented 3 years ago

Hello - I am sorry to hear you are getting this error. When I ran this code around April 2020, it was fine you can also see screen video for all codes under readme - this will be very helpful. The Jupyter notebook version I ran in my local PC and not in Amazon Sagemaker. The code ran fine back in April (as I mentioned before) - however, when I ran the code now - it gave an error on the line x = GlobalAveragePolling2D(x) - all this was doing was to take the output of shape (none, 1000) and doing a Global average pooling - this is not required now - so I just uncommented the line and the code rang fine. Can you try this in Sagemaker. Also in readme, there is an Errors and Additional description section - I will post this over there along with code output from the latest run and put the new code in GitHub

KrishkarPackt commented 3 years ago

Hello, I ran the code accidentally with include_top = true (which is not correct), that is why I had to comment out Global average pooling. The code should be ran with include_top = false (as indicated already in the GitHub) and you do not have to uncomment anything - it ran fine. The only change I had to make is replace keras with tensorflow.keras. Can you make this change and sent to me. If it still give you error. If it still gives you error, please send me the code you are running so I can run in my PC to check.

KrishkarPackt commented 3 years ago

I know you posted 11 days ago, but I got notified only yesterday. I wanted to make sure you are using three classes as the example in the book describes - this was the issue for #7 .If you use different number of classes, you have to define it correctly under class list. For faster response please connect me in Linkedin.

duffjay commented 3 years ago

Attached is my notebook. There are no material changes from your code - I just added some comments as I was following along and I changed the path to my images. SageMaker doesn't require any coding changes. I ran this:

I changed keras to tensorflow.keras as you (wisely) suggested.

chapt04_classification_VGG.zip

Thanks for your help. Great book - it's definitely a worthwhile book, nice job.

KrishkarPackt commented 3 years ago

Chapter4_classification_visualization_custom_model&VGG_JD.ipynb.zip screenshot_jd

Hi, I ran your code in my laptop with NVIDIA GPU 6 GB - this is what I used to run all my program, it compiled and ran - please see screenshot and the program. Instead of using Sagemaker path, I used local directory. Used TensorFlow 2.2. I think the issue is not the code issue - it is data processing issue. Can you please check the following in your data files: 1) in your train directory make sure there are three folders and equal number of images in each 2) verify there is no blank image - meaning file name is there but image can not be open 3) verify the same in test directory

I have attached your code that ran in my PC - I just added path for my file. I have also attached screenshot showing it ran 10 epochs.

duffjay commented 3 years ago

Success! Thanks for your help. You are right, it is a data problem. I'm sure I was careless at some point and missed some important details. If you sell 1,000,000 copies of your book, someone else will probably have the same issue. Here is what I found when comparing your notebook output:
https://github.com/PacktPublishing/Mastering-Computer-Vision-with-TensorFlow-2.0/blob/master/Chapter04/Chapter4_classification_visualization_custom_model%26VGG.ipynb
and my output.

a) make sure image size defined at the top of the notebook corresponds to the model choice (custom / VGG) later in the book. I was careless on that detail. b) I was not able to compile the model on my laptop (Dell Inspiron i7 w/ 1060 - 6 GB) - not enough memory c) it ran fine on AWS SageMaker TF 2.1.3 (2020.12.30) d) the real problem, as you correctly identified, was the data. I probably missed something here but let me point this out if someone else makes the same mistake.

Download the zip file from Kaggle

ls furniture-images/img/train/bed | wc
    900     900   11700
ls img/train/bed | wc
    900     900   11700

seemed liked the data was there twice. I went with furniture-images/img and deleted the redundant directory

Here's the important part.

ls furniture-images/img/train
bed  chair  sofa  swivelchair  table

You'll see five (5) classes - you had 3
In your notebook, your train_generator:
Found 1352 images belonging to 3 classes.

AFTER DELETING two (2) classes (that is directories: swivelchair & table), my notebook output was: Found 2700 images belonging to 3 classes.

(delete the two classes/directories from val also) So, the dataset changed. 3 classes vs 5 classes, 1352 images vs 2700 images (which wasn't a problem.) Then everything worked fine. Thanks for your help.

duffjay commented 3 years ago

there is not test data either. https://www.kaggle.com/akkithetechie/furniture-detector

duffjay commented 3 years ago

so now the original error message makes a little more sense. You had 3 classes but the data had 5 classes thus, the mismatch on the array: logits_size=[10,3] labels_size=[10,5]

Attached is a zipped notebook that works with all 5 classes - pretty simple changes (but terrible python programming): chapt4_classification.zip

(and there was no test data so I just used some validation images)