eaplatanios / tensorflow_scala

TensorFlow API for the Scala Programming Language
http://platanios.org/tensorflow_scala/
Apache License 2.0
936 stars 96 forks source link

Error while training Estimators on CIFAR #133

Closed mandar2812 closed 5 years ago

mandar2812 commented 5 years ago

So after updating DynaML code base to be compatible with tf-0.3.0, see branch tf-0.3.0, the cifar.sc example throws errors.

2018-10-17 15:50:31.729 [main] INFO  CIFAR Data Loader - Extracting data from file '/Users/mandar/tmp/cifar-10-binary.tar.gz'.
2018-10-17 15:50:33.580550: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
2018-10-17 15:50:35.981 [main] INFO  CIFAR Data Loader - Finished loading the CIFAR-10 dataset.
Building the model.

Training model.

org.platanios.tensorflow.jni.InvalidArgumentException: 4 errors while building NodeDef 'Estimator/Train/Model/Gradients/Concatenate_1Gradient/ConcatenateOffset' using Op<name=ConcatOffset; signature=concat_dim:int32, shape:N*int32 -> offset:N*int32; attr=N:int,min=2>:
Input 'shape' passed int64 expected int32
Input 'shape' passed int64 expected int32
Input 'shape' passed int64 expected int32
Input 'shape' passed int64 expected int32
  org.platanios.tensorflow.jni.Op$.finish(Native Method)
  org.platanios.tensorflow.api.ops.Op$Builder$$anonfun$build$1.apply(Op.scala:1406)
  org.platanios.tensorflow.api.ops.Op$Builder$$anonfun$build$1.apply(Op.scala:1370)
  org.platanios.tensorflow.api.utilities.package$.using(package.scala:31)
  org.platanios.tensorflow.api.ops.Op$Builder.build(Op.scala:1370)
  org.platanios.tensorflow.api.ops.Basic$class.concatenateOffset(Basic.scala:577)
  org.platanios.tensorflow.api.ops.Basic$.concatenateOffset(Basic.scala:1749)
  org.platanios.tensorflow.api.ops.Basic$Gradients$.org$platanios$tensorflow$api$ops$Basic$Gradients$$concatenateGradient(Basic.scala:2276)
  org.platanios.tensorflow.api.ops.Basic$Gradients$$anonfun$22.apply(Basic.scala:2175)
  org.platanios.tensorflow.api.ops.Basic$Gradients$$anonfun$22.apply(Basic.scala:2175)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2$$anonfun$9.apply(Gradients.scala:134)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2$$anonfun$9.apply(Gradients.scala:134)
  org.platanios.tensorflow.api.ops.Gradients$.org$platanios$tensorflow$api$ops$Gradients$$maybeCompile(Gradients.scala:271)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply$mcV$sp(Gradients.scala:134)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(Gradients.scala:131)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1$$anonfun$apply$mcV$sp$2.apply(Gradients.scala:131)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWith(Op.scala:869)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Gradients.scala:131)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1.apply(Gradients.scala:99)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1$$anonfun$apply$mcV$sp$1.apply(Gradients.scala:99)
  org.platanios.tensorflow.api.ops.Gradients$.org$platanios$tensorflow$api$ops$Gradients$$maybeColocateWith(Gradients.scala:235)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1.apply$mcV$sp(Gradients.scala:99)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1.apply(Gradients.scala:58)
  org.platanios.tensorflow.api.ops.Gradients$$anonfun$gradients$1.apply(Gradients.scala:58)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWithNameScope(Op.scala:892)
  org.platanios.tensorflow.api.ops.Gradients$.gradients(Gradients.scala:58)
  org.platanios.tensorflow.api.ops.training.optimizers.Optimizer$class.computeGradients(Optimizer.scala:122)
  org.platanios.tensorflow.api.ops.training.optimizers.Adam.computeGradients(Adam.scala:72)
  org.platanios.tensorflow.api.learn.SimpleSupervisedTrainableModel.buildTrainOps(Model.scala:415)
  org.platanios.tensorflow.api.learn.SimpleSupervisedTrainableModel.buildTrainOps(Model.scala:391)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1$$anonfun$apply$mcV$sp$1$$anonfun$8.apply(FileBasedEstimator.scala:144)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1$$anonfun$apply$mcV$sp$1$$anonfun$8.apply(FileBasedEstimator.scala:144)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWithNameScope(Op.scala:897)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(FileBasedEstimator.scala:144)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1$$anonfun$apply$mcV$sp$1.apply(FileBasedEstimator.scala:139)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1$$anonfun$apply$mcV$sp$1.apply(FileBasedEstimator.scala:139)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWith(Op.scala:869)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1.apply$mcV$sp(FileBasedEstimator.scala:139)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1.apply(FileBasedEstimator.scala:117)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator$$anonfun$trainWithHooks$1.apply(FileBasedEstimator.scala:117)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWithNameScope(Op.scala:897)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator.trainWithHooks(FileBasedEstimator.scala:117)
  org.platanios.tensorflow.api.learn.estimators.FileBasedEstimator.train(FileBasedEstimator.scala:87)
  io.github.mandar2812.dynaml.tensorflow.Learn$$anonfun$9.apply(Learn.scala:538)
  io.github.mandar2812.dynaml.tensorflow.Learn$$anonfun$9.apply(Learn.scala:500)
  scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
  org.platanios.tensorflow.api.ops.Op$.createWith(Op.scala:869)
  org.platanios.tensorflow.api.ops.Op$API$class.createWith(Op.scala:422)
  org.platanios.tensorflow.api.package$tf$.createWith(package.scala:205)
  io.github.mandar2812.dynaml.tensorflow.Learn$.build_tf_model(Learn.scala:500)
  ammonite.$file.scripts.cifar$.<init>(cifar.sc:63)
  ammonite.$file.scripts.cifar$.<clinit>(cifar.sc)
mandar2812 commented 5 years ago

One thing I notice is this

DynaML>import org.platanios.tensorflow.api._ 
import org.platanios.tensorflow.api._

DynaML>Shape(1, 2, 3) 
res1: Shape = [1, 2, 3]

DynaML>res1.toTensor 
res2: Tensor[INT64] = INT64[3]

I see that calling .toTensor method on a Shape object returns a INT64 tensor by default, is this causing the problem maybe?

eaplatanios commented 5 years ago

@mandar2812 Yes the default shape data type is not INT64 and that is because of an awkward kind of hack within the TensorFlow main codebase. I'll post a couple links to relevant GitHub issues I filed a while ago, once I get to the office.

Regarding the error you're getting. Would it be too difficult to update to 0.4.0-SNAPSHOT? I made some major updates and now auto-differentiation is also type-safe, which should hopefully resolve the error you're getting.

eaplatanios commented 5 years ago

@mandar2812 Please disregard my previous comment. I think I'll be defaulting to INT32 for shape to tensor conversions, but will update you on a couple of days.

eaplatanios commented 5 years ago

The relevant TF issue is here.

eaplatanios commented 5 years ago

@mandar2812 I settled on defaulting to INT32 for shapes as that is what the TF Python API does and also because switching to INT32 results in a 7-fold performance increase for my MT library, when working on GPUs. :)

So, I'll close this for now and feel free to reopen if the issue persists for you for some reason. I'll be pushing a full 0.4.0 release later today hopefully. I'm only currently having some issues with Scala 2.11 support that I hope to resolve soon.

mandar2812 commented 5 years ago

@eaplatanios I'm eagerly waiting for the 0.4.0 release, DynaML branch tf-0.4.0 is tracking the last tf-scala snapshot. I am able to compile my code (the part of it which depends on tf-scala). Only problem I am noticing is that resolution of implicits is a bigger challenge in the typed Output and Tensor API, especially when it comes to declaring estimators! I saw some of your recent commits were bug fixes around implicits. I hope they help in this regard! Good luck with the release!

[EDIT]: So the main problem I am observing is the resolution of implicts around NestedStructure.Aux[(X, Y), (XT, YT), ...]. These problems are occurring in DynaML's test suite, if you want to take a look I would be happy to point them out, especially if it helps in a better tf-scala release. Let me know.

eaplatanios commented 5 years ago

@mandar2812 Thanks a lot! It's great to hear you've been updating to support 0.4.0. :) And yes, there was an issue with resolution of implicits. It's been resolved for Scala 2.12, but I'm having issues with 2.11. Shapeless' Lazy/Strict seems to not be working correctly on 2.11 and so I'll try to resolve this by getting some help from the shapeless community, as I feel it shouldn't be a problem on our end.