apple / tensorflow_macos

TensorFlow for macOS 11.0+ accelerated using Apple's ML Compute framework.
Other
3.66k stars 308 forks source link

Errors training on M1 GPU #202

Open peterjungx opened 3 years ago

peterjungx commented 3 years ago

Training runs on CPU without problems, but when I attempt training on GPU (setting mlcompute.set_mlc_device(device_name='gpu')) I repeatedly get the following error:

Error: command buffer exited with error status.
    The Metal Performance Shaders operations encoded on it may not have completed.
    Error: 
    (null)
    Ignored (for causing prior/excessive GPU errors) (IOAF code 4)
    <AGXM1FamilyCommandBuffer: 0x3e0728f80>
    label = <none> 
    device = <AGXM1Device: 0x12abb8400>
        name = Apple M1 
    commandQueue = <AGXM1FamilyCommandQueue: 0x14f87b400>
        label = <none> 
        device = <AGXM1Device: 0x12abb8400>
            name = Apple M1 
    retainedReferences = 1

The Python process does not crash, but after training the model expectedly gives bogus results. Also there is some minimal GPU activity seen in Activity Monitor when watching the python3.8 process.

I should mention I am loading a model of about 550MB size (.mat) and using the old tf.compat.v1.Graph() and tf.compat.v1.Session() interfaces.

What is the likely cause of this behaviour? How should I attempt fixing it?

Edit: At the beginning of the output I noticed the following warning message:

WARNING:tensorflow:Eager mode uses the CPU. Switching to the CPU.

But obviously my code attempts to use the GPU. I am confused.