Open novakov-alexey-zz opened 3 years ago
@eaplatanios I can verify that this happens whenever a convolutional layer is used. I have reproduced this in 0.6.0-SNAPSHOT
.
There seems to be some error resulting due to the Graph optimizer Tensorflow is using in the backend. What do you think?
This does indeed look related to grappler (the TF graph optimizer). Does it reproduce with version 0.6.3
?
Just tried with 0.6.4
. It still fails.
Still failing with version 0.6.5
(Linux) with CIFAR model.
import tensorflow.data.image.CIFARLoader
val dataSet = CIFARLoader.load(Paths.get("/home/windymelt/Downloads/cifar-100-python"), CIFARLoader.CIFAR_100)
import tensorflow.api.ops.data.Data
import tensorflow.api.::
val trainImages = () => Data.datasetFromTensorSlices(dataSet.trainImages, "TrainImages").map(_.toFloat)
val trainLabels = () => Data.datasetFromTensorSlices(dataSet.trainLabels(::, 1), "TrainLabels").map(_.toLong)
val trainData = () => trainImages().zip(trainLabels())
.repeat()
.shuffle(10000)
.batch(32)
.prefetch(10)
import tensorflow.api._
import tensorflow.api.learn.layers._
val input = Input(
FLOAT32,
Shape(-1, dataSet.trainImages.shape(1), dataSet.trainImages.shape(2), dataSet.trainImages.shape(3))
)
val trainInput = Input(INT64, Shape(-1))
import tensorflow.api.ops.NN.SameConvPadding
val layer = //Conv2D[Float]("Layer_0/Conv2D", Shape(2, 2, 3, 16), 1, 1, SameConvPadding) >>
AddBias[Float]("Layer_0/Bias") >>
ReLU[Float]("Layer_0/ReLU", 0.1f) >>
MaxPool[Float]("Layer_0/MaxPool", Seq(1, 2, 2, 1), 1, 1, SameConvPadding) >>
// Conv2D[Float]("Layer_1/Conv2D", Shape(2, 2, 16, 32), 1, 1, SameConvPadding) >>
AddBias[Float]("Bias_1") >>
ReLU[Float]("Layer_1/ReLU", 0.1f) >>
MaxPool[Float]("Layer_1/MaxPool", Seq(1, 2, 2, 1), 1, 1, SameConvPadding) >>
Flatten[Float]("Layer_2/Flatten") >>
Linear[Float]("Layer_2/Linear", 256) >>
ReLU[Float]("Layer_2/ReLU", 0.1f) >>
Linear[Float]("OutputLayer/Linear", 100)
val loss = SparseSoftmaxCrossEntropy[Float, Long, Float]("Loss/CrossEntropy") >>
Mean[Float]("Loss/Mean") >>
ScalarSummary[Float]("Loss/Summary", "Loss")
val optimizer = tf.train.AdaGrad(0.1f)
val model = tf.learn.Model.simpleSupervised(
input = input,
trainInput = trainInput,
layer = layer,
loss = loss,
optimizer = optimizer)
val summariesDir = Paths.get("temp/cnn-cifar")
val estimator = tensorflow.api.learn.estimators.InMemoryEstimator(
model,
tensorflow.api.learn.Configuration(Some(summariesDir)),
tensorflow.api.learn.StopCriteria(maxSteps = Some(100000)),
Set(
tensorflow.api.learn.hooks.LossLogger(trigger = tf.learn.StepHookTrigger(100)),
tensorflow.api.learn.hooks.StepRateLogger(log = false, summaryDir = summariesDir, trigger = tensorflow.api.learn.hooks.StepHookTrigger(100)),
tensorflow.api.learn.hooks.CheckpointSaver(summariesDir, tensorflow.api.learn.hooks.StepHookTrigger(1000))),
tensorBoardConfig = tensorflow.api.config.TensorBoardConfig(summariesDir, reloadInterval = 1))
estimator.train(trainData, tensorflow.api.learn.StopCriteria(maxSteps = Some(10000)))
I removed Conv2D
layer like above code snippet then it works without SIGFPE.
Error message follows:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGFPE (0x8) at pc=0x00007f056cd2240b, pid=10315, tid=10938
#
# JRE version: OpenJDK Runtime Environment Corretto-17.0.5.8.1 (17.0.5+8) (build 17.0.5+8-LTS)
# Java VM: OpenJDK 64-Bit Server VM Corretto-17.0.5.8.1 (17.0.5+8-LTS, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libtensorflow.so.2+0xa92240b] tensorflow::grappler::OpLevelCostEstimator::ConvolutionDimensionsFromInputs(tensorflow::TensorShapeProto const&, tensorflow::TensorShapeProto const&, tensorflow::OpInfo const&, bool*)+0x2fb
#
# Core dump will be written. Default location: Core dumps may be processed with "/bin/false" (or dumping to /home/windymelt/src/github.com/windymelt/tensorflow-scala-exercice/core.10315)
#
# An error report file with more information is saved as:
# /home/windymelt/src/github.com/windymelt/tensorflow-scala-exercice/hs_err_pid10315.log
#
# If you would like to submit a bug report, please visit:
# https://github.com/corretto/corretto-17/issues/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
[1] 10315 IOT instruction (core dumped) sbt run
Using
tf.learn.Conv2D
layer for MNIST dataset leads to a fatal error somewhere in C++ code of tensorflow library.How to reproduce
Below code leads to an error. This code is based on existing examples of MNIST and CIFAR:
Version:
Error:
Observations:
Above error contains suspicious message about problematic frame:
Some extract from the error log: