apple / tensorflow_macos

TensorFlow for macOS 11.0+ accelerated using Apple's ML Compute framework.
Other
3.67k stars 308 forks source link

MetalPerformanceShaders errors when running Albert model #97

Open mflis opened 3 years ago

mflis commented 3 years ago

Hi, I noticed fatal error when trying to fine tune albert-base-v2 model from transformers package with tensorflow_macos Here's repo with reproducible case: https://github.com/mflis/gpu_experiments/tree/albert_issue

When running on cpu I can observe this error:

Python(13514,0x70000979d000) malloc: Incorrect checksum for freed object 0x7fd8a3e82800: probably modified after being freed.
Corrupt value: 0x3b7eac9f3bec9097
Python(13514,0x70000979d000) malloc: *** set a breakpoint in malloc_error_break to debug

On gpu error looks like this:

/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MetalPerformanceShaders-124.0.30/MPSCore/Types/MPSMatrix.mm:241: failed assertion `[MPSMatrix initWithBuffer:descriptor:] buffer may not be nil.'
fish: 'python tf_gpu_experiment.py' terminated by signal SIGABRT (Abort)

On "standard" tensorflow 2.4.0 training runs without problems. Also running bert-base-uncased works OK on tensorflow_macos. I'm running MacBook Pro 2019 (more details in repo)

If you need more details, I'll be happy to help. Thanks.

anna-tikhonova commented 3 years ago

Thank you very much for reporting this. We will investigate and report back.