aws-samples / sagemaker-101-workshop

Hands-on demonstrations for data scientists exploring Amazon SageMaker
76 stars 47 forks source link

MNIST model CPU training broken in TF v2.7 (conda_tensorflow2_p37 kernel on NBI ALv2 JLv3) #28

Open athewsey opened 2 years ago

athewsey commented 2 years ago

The current conda_tensorflow2_p38 kernel on the latest SageMaker Notebook Instance platform (notebook-al2-v2, as used in the CFn template) seems to break local CPU-only training for the MNIST migration challenge.

In this environment (TF v2.7.1, TF.Keras v2.7.0), tensorflow.keras.backend.image_data_format() asks for channels_first, but training fails because MaxPoolingOp only supports channels_last on CPU - per the error message below:

InvalidArgumentError:  Default MaxPoolingOp only supports NHWC on device type CPU
     [[node sequential/max_pooling2d/MaxPool
 (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/layers/pooling.py:357)
]] [Op:__inference_train_function_862]

Errors may have originated from an input operation.
Input Source operations connected to node sequential/max_pooling2d/MaxPool:
In[0] sequential/conv2d_1/Relu (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/backend.py:4867)

Overriding the image_data_format() check (in "Pre-Process the Data for our CNN") to prepare data in different shape does not work because the model is incompatible (will raise ValueError in conv2d_2).

Still seems to be working fine in current SMStudio kernel (TensorFlow v2.3.2, TF.Keras v2.4.0).