jameshfisher commented 3 years ago

Steps to reproduce

Install tensorflow_macos (I did so with the bash script).
Save the following program as corruption.py.
Run the program with python corruption.py.

import tensorflow as tf

gen  = tf.keras.layers.Conv2D(1, 1)
disc = tf.keras.layers.Conv2D(1, 1, activation='relu')

gen_opt  = tf.keras.optimizers.Adam()
disc_opt = tf.keras.optimizers.Adam()

@tf.function
def train_step(input_image):
  with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
    gen_output = gen(input_image)
    disc_output = disc(gen_output)
    gen_loss = tf.keras.losses.binary_crossentropy(tf.ones_like(disc_output), disc_output)
    disc_loss = tf.keras.losses.binary_crossentropy(tf.zeros_like(disc_output), disc_output)
  gen_gradients = gen_tape.gradient(gen_loss, gen.trainable_variables)
  disc_gradients = disc_tape.gradient(disc_loss, disc.trainable_variables)
  gen_opt.apply_gradients(zip(gen_gradients, gen.trainable_variables))
  disc_opt.apply_gradients(zip(disc_gradients, disc.trainable_variables))

while True:
  print("step")
  train_step(tf.random.normal([1, 8, 8, 1], dtype=tf.float32))

(This program is a minimized test case derived from following the Pix2Pix tutorial. Every line in this program seems to be necessary to cause the memory corruption.)

Expected behavior

The program does not does not abort or segfault.

Actual behavior

The program non-deterministically fails in one of the following ways:

$ python segfault.py
step
2021-03-03 20:23:13.328 Python[72904:5994415] *** Terminating app due to uncaught exception 'NSRangeException', reason: '*** -[__NSArrayM objectAtIndexedSubscript:]: index 0 beyond bounds for empty array'
*** First throw call stack:
(
    0   CoreFoundation                      0x00007fff2060f6af __exceptionPreprocess + 242
    1   libobjc.A.dylib                     0x00007fff203473c9 objc_exception_throw + 48
    2   CoreFoundation                      0x00007fff206c3a9a -[__NSCFString characterAtIndex:].cold.1 + 0
    3   CoreFoundation                      0x00007fff20582e26 -[__NSArrayM objectAtIndexedSubscript:] + 142
    4   MLCompute                           0x00007fff2a1dcf40 -[MLCTrainingGraph resultGradientTensorToUseByExecuteGradientForLayer:sourceIndex:incrementIntermediateIndex:] + 190
    5   MLCompute                           0x00007fff2a1de5e2 -[MLCTrainingGraph allocateGradientTensorsForLayersInGraph:] + 803
    6   MLCompute                           0x00007fff2a1df116 -[MLCTrainingGraph compileAndAllocateGradientTensorsForGraph:] + 126
    7   MLCompute                           0x00007fff2a1e8704 -[MLCTrainingGraph executeGradientWithBatchSize:options:outputsData:completionHandler:] + 939
    8   _pywrap_tensorflow_internal.so      0x0000000111bf60f0 _ZN10tensorflow9mlcompute7kernels13MLCSubgraphOp23ExecuteMLCTrainingGraphEPNS_15OpKernelContextEPNS1_10MLCContextEj + 736
    9   _pywrap_tensorflow_internal.so      0x0000000111bf4e69 _ZN10tensorflow9mlcompute7kernels13MLCSubgraphOp20ProcessMLCSubgraphOpEPNS_15OpKernelContextEPPNS1_10MLCContextEPPNS1_15TFContextStatusE + 537
    10  _pywrap_tensorflow_internal.so      0x0000000111bf8404 _ZN10tensorflow9mlcompute7kernels13MLCSubgraphOp7ComputeEPNS_15OpKernelContextE + 868
    11  libtensorflow_framework.2.dylib     0x0000000129bcec7c _ZN10tensorflow12_GLOBAL__N_113ExecutorStateINS_21SimplePropagatorStateEE7ProcessENS2_10TaggedNodeEx + 4124
    12  libtensorflow_framework.2.dylib     0x0000000129c68b03 _ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi + 1667
    13  libtensorflow_framework.2.dylib     0x0000000129c68382 _ZZN10tensorflow6thread16EigenEnvironment12CreateThreadENSt3__18functionIFvvEEEENKUlvE_clEv + 66
    14  libtensorflow_framework.2.dylib     0x0000000129c56438 _ZN10tensorflow12_GLOBAL__N_17PThread8ThreadFnEPv + 104
    15  libsystem_pthread.dylib             0x00007fff2049d950 _pthread_start + 224
    16  libsystem_pthread.dylib             0x00007fff2049947b thread_start + 15
)
libc++abi.dylib: terminating with uncaught exception of type NSException
Abort trap: 6

$ python segfault.py
step
step
...
step
step
Python(72940,0x70000b6d7000) malloc: *** error for object 0x600007fcb8c1e11b: pointer being freed was not allocated
Python(72940,0x70000b6d7000) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

$ python segfault.py
step
step
...
step
step
Python(73164,0x70000826b000) malloc: Double free of object 0x7ffd5a2fb100
Python(73164,0x7000081e8000) malloc: Incorrect checksum for freed object 0x7ffd5a2fd4c0: probably modified after being freed.
Corrupt value: 0x3f82bf133f8333e3
Python(73164,0x70000826b000) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

$ python segfault.py
step
step
...
step
step
Python(73302,0x70000165d000) malloc: *** error for object 0x7fdf28aa3400: pointer being freed was not allocated
Python(73302,0x700001763000) malloc: tiny_free_list_remove_ptr: Internal invariant broken (prev ptr of next): ptr=0x7fdf28aa8980, next_prev=0x70
Python(73302,0x70000165d000) malloc: *** set a breakpoint in malloc_error_break to debug
Abort trap: 6

System details

$ python --version
Python 3.8.2
$ python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
v1.12.1-44680-gc3fea33a21 2.4.0-rc0
$ sw_vers
ProductName:    macOS
ProductVersion: 11.2.1
BuildVersion:   20D74
$ /Volumes/Macintosh\ HD/usr/sbin/system_profiler SPHardwareDataType
Hardware:

    Hardware Overview:

      Model Name: MacBook Air
      Model Identifier: MacBookAir8,2
      Processor Name: Dual-Core Intel Core i5
      Processor Speed: 1.6 GHz
      Number of Processors: 1
      Total Number of Cores: 2
      L2 Cache (per Core): 256 KB
      L3 Cache: 4 MB
      Hyper-Threading Technology: Enabled
      Memory: 16 GB
...

jameshfisher commented 3 years ago

(Note: I have omitted the following output from all examples. I get this output every time, which I gather is normal and not an error.)

2021-03-03 20:31:15.060161: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-03 20:31:15.060909: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-03 20:31:15.603496: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)

AlexFWulff commented 3 years ago

+1. I'm having similar issues when following this tutorial, albeit with somewhat different memory or Metal-related errors. @jameshfisher did you ever resolve this?

jameshfisher commented 3 years ago

Nope, still open afaik

apple / tensorflow_macos

Memory corruption crashes in program with multiple gradient tapes #186

Steps to reproduce

Expected behavior

Actual behavior

System details