M1/M2: Large matrix multiplications can contains NaNs

chengchingwen commented 4 months ago

MWE:

julia> a = Metal.randn(10000, 10000);

julia> b = Metal.randn(10000, 10000);

julia> c = a * b';

julia> for i in 1:10
           C = Metal.zeros(Float32, size(a))
           mul!(C, a, b')
           @assert C ≈ c "$i"
       end
ERROR: AssertionError: 1
Stacktrace:
 [1] top-level scope
   @ ./REPL[58]:4

julia> for i in 1:10
           C = Metal.zeros(Float32, size(a))
           mul!(C, a, b')
           @assert C ≈ c "$i"
       end
ERROR: AssertionError: 8
Stacktrace:
 [1] top-level scope
   @ ./REPL[58]:4

julia> for i in 1:10
           @assert a * b' ≈ c "$i"
       end
ERROR: AssertionError: 3
Stacktrace:
 [1] top-level scope
   @ ./REPL[59]:2

julia> for i in 1:10
           @assert a * b' ≈ c "$i"
       end
ERROR: AssertionError: 8
Stacktrace:
 [1] top-level scope
   @ ./REPL[59]:2

chengchingwen commented 4 months ago

adding wait_completed on matmul!'s command buffer does not help

christiangnrd commented 4 months ago

Adding Metal.@sync to the mul! also does not help. ~However, I cannot reproduce when calling MPS.matmul! directly.~

maleadt commented 3 months ago

I cannot reproduce at all on Metal.jl#master using an M3 Pro, but it does seem reproducible on an M1 Pro.

I wonder if this is a problem with mapreduce, since you're calling isapprox on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c) makes things pass? It does here, at least.

tgymnich commented 3 months ago

I can reproduce the issue on M1 master. It also looks like all the tasks run on the same queue.

chengchingwen commented 3 months ago

The issue was found on a M2 Max. The MWE only happens if the array is large enough. It seems to be launching the subsequent kernel before the matmul finished. Is it possible that the mapreduce not checking the availability of the input arrays?

p.s. I'm about to board the plane to JuliaCon so I won't be able to test it soon.

maleadt commented 3 months ago

I wonder if this is a problem with mapreduce, since you're calling isapprox on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c) makes things pass? It does here, at least.

It also reproduces when comparing on the CPU, just much less likely, so this isn't a mapreduce issue.

maleadt commented 3 months ago

Looks like a bunch of NaN's in the second matrix.

christiangnrd commented 3 months ago

My current MWE is:

using Metal, LinearAlgebra; begin
    n = 10000
    a = mtl(randn(Float32,n,n))
    b = mtl(randn(Float32,n,n))
    C = Metal.zeros(Float32, size(a))
    for i in 1:10
        C = Metal.zeros(Float32, size(a))
        mul!(C,a,b)
        @assert !any(isnan.(C)) "$i"
    end
end

I define C out of the loop to access it afterwards. When I had C .= ... in the loop instead of C = .... It only ever happened at iteration 1. I suspect it has to do with the location in memory of the array.

maleadt commented 3 months ago

I cannot reproduce when calling MPS.matmul! directly

I can:

using Metal, LinearAlgebra

function main(T=Float32, N=10000)
    a = Metal.rand(T, N, N)
    b = Metal.rand(T, N, N)
    c = a * b'
    synchronize()

    for i in 1:100
        println("Iteration $i")
        d = Metal.zeros(T, size(a))
        MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
                    #=transpose_a=#false, #=transpose_b=#true)
        @assert !any(isnan.(Array(d))) "NaN in iteration $i"

        # XXX: this redundant check is needed, or the failure never occurs
        @assert !any(isnan.(d))
    end
end

isinteractive() || main()

The need for a secondary kernel is very weird.

tgymnich commented 3 months ago

It is ~~not~~ MPS related:

 for i in 1:10
       C = Metal.zeros(Float32, size(a))
       GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())
       @assert C ≈ c "$i"
end

maleadt commented 3 months ago

GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())

I don't see how that's related; it's an entirely different kernel. Does it contain NaNs in similar places? The generic matmatmul kernel, while being extraordinarily slow, doesn't introduce NaNs here.

tgymnich commented 3 months ago

Just wanted to confirm that its MPS rather than the synchronisation between kernel launches.

tgymnich commented 3 months ago

I've been seeing the NaN issues with large arrays for a long time in #145

MPX seems fine:

import mlx.core as mx

a = mx.random.normal((10000, 10000))
b = mx.random.normal((10000, 10000))
c = a @ b.T

for i in range(0,10):
    C = a @ b.T
    assert(mx.allclose(C,c))

christiangnrd commented 3 months ago

I would love for someone to review my code because I'm not a Swift expert by any means, but I was able to reproduce this in the Swift REPL.

Swift MWE

``` import Metal import MetalPerformanceShaders func main(T: Float.Type = Float32.self, N: Int = 10000) { guard let device = MTLCreateSystemDefaultDevice(), let commandQueue = device.makeCommandQueue() else { fatalError("Metal device or command queue could not be created") } print("Initializing a & b") // Generate random NxN matrices var a = [Float](repeating: 1, count: N * N) var b = [Float](repeating: 1, count: N * N) print("a and b created\n") // Metal buffers for matrices let aBuffer = device.makeBuffer(bytes: &a, length: MemoryLayout.size * N * N, options: []) let bBuffer = device.makeBuffer(bytes: &b, length: MemoryLayout.size * N * N, options: []) print("Starting matmul\n") for i in 1...10 { print(i) print("\n") // Create MPSMatrices let aMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout.size * N, dataType: .float32) let bMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout.size * N, dataType: .float32) let aMatrix = MPSMatrix(buffer: aBuffer!, descriptor: aMatrixDescriptor) let bMatrix = MPSMatrix(buffer: bBuffer!, descriptor: bMatrixDescriptor) // Matrix multiplication using MPSMatrixMultiplication let matrixMultiplication = MPSMatrixMultiplication(device: device, transposeLeft: false, transposeRight: false, resultRows: N, resultColumns: N, interiorColumns: N, alpha: 1.0, beta: 0.0) let cBuffer = device.makeBuffer(length: MemoryLayout.size * N * N, options: []) let cMatrixDescriptor = MPSMatrixDescriptor(rows: N, columns: N, rowBytes: MemoryLayout.size * N, dataType: .float32) let cMatrix = MPSMatrix(buffer: cBuffer!, descriptor: cMatrixDescriptor) let commandBuffer = commandQueue.makeCommandBuffer()! matrixMultiplication.encode(commandBuffer: commandBuffer, leftMatrix: aMatrix, rightMatrix: bMatrix, resultMatrix: cMatrix) commandBuffer.commit() commandBuffer.waitUntilCompleted() // Check for NaNs in the result matrix let cPointer = cBuffer!.contents().bindMemory(to: Float.self, capacity: N * N) var j = 0 while j < N*N { if cPointer[j].isNaN { fatalError("NaN in iteration \(i)") } j += 1 } } } Output: Initializing a & b a and b created Starting matmul 1 2 3 4 __lldb_expr_3/repl.swift:56: Fatal error: NaN in iteration 4 2024-07-12 17:58:38.583349-0300 repl_swift[1500:21665] __lldb_expr_3/repl.swift:56: Fatal error: NaN in iteration 4 Execution interrupted. Enter code to recover and continue. Enter LLDB commands to investigate (type :help for assistance.) ```

tgymnich commented 3 months ago

@christiangnrd Your Swift Code looks good to me. It turns out MPX doesn’t even use MPS.

maleadt commented 3 months ago

Haven't been able to look into this, but here's the ObjC version:

#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>

void performMatrixMultiplication(NSInteger N) {
    if (N == 0) {
        N = 10000;
    }

    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    id<MTLCommandQueue> commandQueue = [device newCommandQueue];

    if (!device || !commandQueue) {
        NSLog(@"Metal device or command queue could not be created");
        return;
    }

    NSLog(@"Initializing a & b");
    // Generate random NxN matrices
    float *a = calloc(N * N, sizeof(float));
    float *b = calloc(N * N, sizeof(float));

    for (NSInteger i = 0; i < N * N; i++) {
        a[i] = 1.0f;
        b[i] = 1.0f;
    }

    NSLog(@"a and b created\n");
    // Metal buffers for matrices
    id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
    id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];

    NSLog(@"Starting matmul\n");
    for (NSInteger i = 1; i <= 10; i++) {
        NSLog(@"%ld\n", (long)i);

        // Create MPSMatrices
        MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];

        MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
        MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];

        // Matrix multiplication using MPSMatrixMultiplication
        MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
                                                                                          transposeLeft:NO
                                                                                         transposeRight:NO
                                                                                             resultRows:N
                                                                                          resultColumns:N
                                                                                       interiorColumns:N
                                                                                                 alpha:1.0
                                                                                                  beta:0.0];

        id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
        MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
                                                                                       columns:N
                                                                                      rowBytes:sizeof(float) * N
                                                                                      dataType:MPSDataTypeFloat32];
        MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];

        id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
        [matrixMultiplication encodeToCommandBuffer:commandBuffer
                                         leftMatrix:aMatrix
                                        rightMatrix:bMatrix
                                       resultMatrix:cMatrix];
        [commandBuffer commit];
        [commandBuffer waitUntilCompleted];

        // Check for NaNs in the result matrix
        float *cPointer = cBuffer.contents;
        for (NSInteger j = 0; j < N * N; j++) {
            if (isnan(cPointer[j])) {
                NSLog(@"NaN in iteration %ld", (long)i);
                free(a);
                free(b);
                return;
            }
        }
    }

    free(a);
    free(b);
}

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSInteger N = 10000;
        if (argc > 1) {
            N = atoi(argv[1]);
        }
        performMatrixMultiplication(N);
    }
    return 0;
}

❯ clang mps.m -o mps -framework Foundation -framework Metal -framework MetalPerformanceShaders -fobjc-arc -mmacosx-version-min=10.13

❯ ./mps
2024-07-13 12:23:11.771 mps[54256:2493528] Initializing a & b
2024-07-13 12:23:11.931 mps[54256:2493528] a and b created
2024-07-13 12:23:12.001 mps[54256:2493528] Starting matmul
2024-07-13 12:23:12.001 mps[54256:2493528] 1
2024-07-13 12:23:13.933 mps[54256:2493528] 2
2024-07-13 12:23:15.477 mps[54256:2493528] 3
2024-07-13 12:23:16.997 mps[54256:2493528] 4
2024-07-13 12:23:18.440 mps[54256:2493528] NaN in iteration 4

tgymnich commented 3 months ago

Should we just file a radar / feedback?

maleadt commented 3 months ago

I'll have a better look first and forward it to our Apple contact.

maleadt commented 2 months ago

Apparently this looks like an ARC bug. Curiously, the ObjC reproducer is "fixed" by adding an @autoreleasepool around the for loop body, but the same doesn't hold in Julia (in fact, the original issue was calling into mul! which is already marked @autoreleasepool).

Of course, the Julia MWE is more complex, as the @assert !any(isnan.(d)) involves two additional kernels...

Still broken Julia MWE

```julia using Metal, LinearAlgebra using ObjectiveC, .Foundation function main(T=Float32, N=10000) a = Metal.rand(T, N, N) b = Metal.rand(T, N, N) synchronize() for i in 1:100 @autoreleasepool begin println("Iteration $i") d = Metal.zeros(T, size(a)) MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false, #=transpose_a=#false, #=transpose_b=#false) @assert !any(isnan.(Array(d))) "NaN in iteration $i" # XXX: this redundant check is needed, or the failure never occurs @assert !any(isnan.(d)) end end end isinteractive() || main() ```

"Fixed" ObjeC MWE

```objc #import #import #import void performMatrixMultiplication(NSInteger N) { if (N == 0) { N = 10000; } id device = MTLCreateSystemDefaultDevice(); id commandQueue = [device newCommandQueue]; if (!device || !commandQueue) { NSLog(@"Metal device or command queue could not be created"); return; } NSLog(@"Initializing a & b"); // Generate random NxN matrices float *a = calloc(N * N, sizeof(float)); float *b = calloc(N * N, sizeof(float)); for (NSInteger i = 0; i < N * N; i++) { a[i] = 1.0f; b[i] = 1.0f; } NSLog(@"a and b created\n"); // Metal buffers for matrices id aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared]; id bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared]; NSLog(@"Starting matmul\n"); for (NSInteger i = 1; i <= 100; i++) { @autoreleasepool { NSLog(@"Iteration %ld\n", (long)i); // Create MPSMatrices MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N columns:N rowBytes:sizeof(float) * N dataType:MPSDataTypeFloat32]; MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N columns:N rowBytes:sizeof(float) * N dataType:MPSDataTypeFloat32]; MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor]; MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor]; // Matrix multiplication using MPSMatrixMultiplication MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device transposeLeft:NO transposeRight:NO resultRows:N resultColumns:N interiorColumns:N alpha:1.0 beta:0.0]; id cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared]; MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N columns:N rowBytes:sizeof(float) * N dataType:MPSDataTypeFloat32]; MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor]; id commandBuffer = [commandQueue commandBuffer]; [matrixMultiplication encodeToCommandBuffer:commandBuffer leftMatrix:aMatrix rightMatrix:bMatrix resultMatrix:cMatrix]; [commandBuffer commit]; [commandBuffer waitUntilCompleted]; // Check for NaNs in the result matrix float *cPointer = cBuffer.contents; for (NSInteger j = 0; j < N * N; j++) { if (isnan(cPointer[j])) { NSLog(@"NaN in iteration %ld", (long)i); free(a); free(b); return; } } } } free(a); free(b); } int main(int argc, const char * argv[]) { @autoreleasepool { NSInteger N = 10000; if (argc > 1) { N = atoi(argv[1]); } performMatrixMultiplication(N); } return 0; } ```

tgymnich commented 2 months ago

Couldn't reproduce the ObjectiveC case today with and without autoreleasepool. Swift and Julia were still reproducible.

christiangnrd commented 2 months ago

I can reproduce the error in both Swift and ObjectiveC and it goes away when surrounded by an autoreleasepool block in both languages.

tgymnich commented 2 months ago

Oops. I just overlooked the second autoreleasepool. The first one is actually not necessary (at least to hide our bug.)

christiangnrd commented 2 months ago

By "the first one" do you mean the autoreleasepool in main?

christiangnrd commented 2 months ago

I'm able to reproduce this without the second redundant check.

Still broken simpler Julia MWE

``` using Metal, LinearAlgebra using ObjectiveC, .Foundation function main(T=Float32, N=10000) a = Metal.rand(T, N, N) b = Metal.rand(T, N, N) synchronize() for i in 1:100 # @autoreleasepool begin begin println("Iteration $i") d = Metal.zeros(T, size(a)) MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false, #=transpose_a=#false, #=transpose_b=#false) @assert !any(isnan.(Array(d))) "NaN in iteration $i" end end end isinteractive() || main() ```

tgymnich commented 1 month ago

Our NSAutoreleasePool seems to contain roughly the same objects before the nan check compared to the objc version from above. Most obvious difference is that the correct objc version has a CaptureMTLDevice and a AGXG13XFamilyComputeContext and we have a AGXG13XFamilyCommandBuffer (could be debug / xcode related).

iteration 1
objc[6905]: ##############
objc[6905]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6905]: 77 releases pending.
objc[6905]: [0x14300d000]  ................  PAGE  (hot) (cold)
objc[6905]: [0x14300d038]  ################  POOL 0x14300d038
objc[6905]: [0x14300d040]    0x6000004c4860  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d048]    0x6000027ccd20  NSBundle  autorelease count 2
objc[6905]: [0x14300d050]    0x6000004cfd40  __NSDictionaryM  autorelease count 2
objc[6905]: [0x14300d058]    0x600002fc8690  MTLCommandQueueDescriptorInternal
objc[6905]: [0x14300d060]    0x600000ac0090  NSUserDefaults  autorelease count 4
objc[6905]: [0x14300d068]    0x6000004c4b20  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d070]    0x6000004d4660  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d078]    0x6000004d4220  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d080]    0x6000004d46a0  __NSSingleEntryDictionaryI
objc[6905]: [0x14300d088]  ################  POOL 0x14300d088
objc[6905]: [0x14300d090]    0x6000011ce180  MPSMatrixDescriptor
objc[6905]: [0x14300d098]    0x6000011cde00  MPSMatrixDescriptor
objc[6905]: [0x14300d0a0]       0x145809000  AGXG13XDevice  autorelease count 15
objc[6905]: [0x14300d0a8]       0x144105550  CaptureMTLDevice  autorelease count 4
objc[6905]: [0x14300d0b0]    0x6000011cc540  __NSCFString
objc[6905]: [0x14300d0b8]    0x600002ac8d80  __NSCFString
objc[6905]: [0x14300d0c0]    0x600003dcc540  NSPathStore2
objc[6905]: [0x14300d0c8]    0x6000011cc600  __NSBundleTables  autorelease count 3
objc[6905]: [0x14300d0d0]    0x6000027dc140  NSBundle  autorelease count 2
objc[6905]: [0x14300d0d8]    0x6000027cd0e0  NSBundle
objc[6905]: [0x14300d0e0]    0x6000020cc480  NSURL
objc[6905]: [0x14300d0e8]    0x6000035cc500  __NSCFString
objc[6905]: [0x14300d0f0]    0x6000004dd4e0  NSFileManager
objc[6905]: [0x14300d0f8]    0x6000020cc5a0  NSURL
objc[6905]: [0x14300d100]    0x6000035cc280  __NSCFString
objc[6905]: [0x14300d108]    0x6000035cc6e0  __NSCFString  autorelease count 2
objc[6905]: [0x14300d110]    0x6000004df3e0  NSConcreteData
objc[6905]: [0x14300d118]    0x6000027cd810  Swift.__StringStorage
objc[6905]: [0x14300d120]    0x6000027cd860  Swift.__StringStorage
objc[6905]: [0x14300d128]    0x6000027cd8b0  Swift.__StringStorage
objc[6905]: [0x14300d130]    0x6000027cd900  Swift.__StringStorage
objc[6905]: [0x14300d138]    0x6000027cd950  Swift.__StringStorage
objc[6905]: [0x14300d140]    0x6000027cd9a0  Swift.__StringStorage
objc[6905]: [0x14300d148]    0x6000035cc6e0  __NSCFString  autorelease count 6
objc[6905]: [0x14300d150]    0x6000011d4980  MPSMatrixDescriptor
objc[6905]: [0x14300d158]       0x144105550  CaptureMTLDevice  autorelease count 2
objc[6905]: [0x14300d160]    0x6000036ceeb0  AGXG13XFamilyComputeContext
objc[6905]: [0x14300d168]    0x6000011d4b80  __NSCFString
objc[6905]: [0x14300d170]    0x600002acaf80  __NSCFString
objc[6905]: [0x14300d178]    0x600003dcc2a0  NSPathStore2
objc[6905]: [0x14300d180]    0x6000011cc600  __NSBundleTables  autorelease count 3
objc[6905]: [0x14300d188]    0x6000027cd0e0  NSBundle
objc[6905]: [0x14300d190]    0x6000027dc140  NSBundle
objc[6905]: [0x14300d198]    0x6000027cda90  NSBundle  autorelease count 2
objc[6905]: [0x14300d1a0]    0x6000020cc600  NSURL
objc[6905]: [0x14300d1a8]    0x6000035cc8c0  __NSCFString
objc[6905]: [0x14300d1b0]    0x6000004df980  NSFileManager
objc[6905]: [0x14300d1b8]    0x6000020cc6c0  NSURL
objc[6905]: [0x14300d1c0]    0x6000035ccb40  __NSCFString
objc[6905]: [0x14300d1c8]    0x6000035cc960  __NSCFString  autorelease count 2
objc[6905]: [0x14300d1d0]    0x6000004d1e40  NSConcreteData
objc[6905]: [0x14300d1d8]    0x6000027cdbd0  Swift.__StringStorage
objc[6905]: [0x14300d1e0]    0x6000027cdc20  Swift.__StringStorage
objc[6905]: [0x14300d1e8]    0x6000027cdc70  Swift.__StringStorage
objc[6905]: [0x14300d1f0]    0x6000027cdcc0  Swift.__StringStorage
objc[6905]: [0x14300d1f8]    0x6000027cdd10  Swift.__StringStorage
objc[6905]: [0x14300d200]    0x6000027cdd60  Swift.__StringStorage
objc[6905]: [0x14300d208]    0x6000035cc960  __NSCFString  autorelease count 6
objc[6905]: [0x14300d210]       0x144105550  CaptureMTLDevice  autorelease count 2
objc[6905]: [0x14300d218]    0x600000a80330  __NSArrayM
objc[6905]: [0x14300d220]    0x600000a80360  __NSArrayM
objc[6905]: [0x14300d228]    0x6000004d2f40  __NSCFString
objc[6905]: [0x14300d230]    0x6000004d2ec0  __NSCFString
objc[6905]: [0x14300d238]    0x6000004d2ee0  __NSCFString
objc[6905]: [0x14300d240]    0x6000004d2f00  __NSCFString
objc[6905]: [0x14300d248]    0x6000008e7240  __NSCFString
objc[6905]: [0x14300d250]    0x6000008e7090  __NSCFString
objc[6905]: [0x14300d258]    0x6000008e70c0  __NSCFString
objc[6905]: [0x14300d260]    0x6000008e70f0  __NSCFString
objc[6905]: [0x14300d268]    0x6000004d2f20  __NSCFString
objc[6905]: [0x14300d270]    0x6000004d2ea0  __NSCFString
objc[6905]: [0x14300d278]    0x6000004d2fc0  __NSCFString
objc[6905]: [0x14300d280]    0x6000008e6e20  __NSArrayM
objc[6905]: [0x14300d288]    0x6000004d3140  __NSCFNumber
objc[6905]: [0x14300d290]       0x14304b800  __NSCFString
objc[6905]: [0x14300d298]    0x6000020cc780  MTLComputePipelineReflectionInternal
objc[6905]: ##############
iteration 2
objc[36563]: ##############
objc[36563]: AUTORELEASE POOLS for thread 0x203b9b240
objc[36563]: 16 releases pending.
objc[36563]: [0x14080a000]  ................  PAGE  (hot) (cold)
objc[36563]: [0x14080a038]  ################  POOL 0x14080a038
objc[36563]: [0x14080a040]    0x600001f3c5a0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a048]    0x600003c202d0  NSBundle  autorelease count 2
objc[36563]: [0x14080a050]    0x600001f2e7a0  __NSDictionaryM  autorelease count 2
objc[36563]: [0x14080a058]    0x60000342c0e0  MTLCommandQueueDescriptorInternal
objc[36563]: [0x14080a060]    0x60000112c2a0  NSUserDefaults  autorelease count 4
objc[36563]: [0x14080a068]    0x600001f3cb00  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a070]    0x600001f3c4e0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a078]    0x600001f3cac0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a080]    0x600001f3cae0  __NSSingleEntryDictionaryI
objc[36563]: [0x14080a088]  ################  POOL 0x14080a088
objc[36563]: [0x14080a090]    0x600000aa2740  MPSMatrixDescriptor
objc[36563]: [0x14080a098]    0x600000aa2780  MPSMatrixDescriptor
objc[36563]: [0x14080a0a0]    0x600000a21040  MPSMatrixDescriptor
objc[36563]: [0x14080a0a8]       0x141005410  CaptureMTLDevice  autorelease count 6
objc[36563]: [0x14080a0b0]    0x600002d24510  AGXG13XFamilyComputeContext
objc[36563]: ##############

Iteration 1
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 20 releases pending.
objc[6186]: [0x12e00b000]  ................  PAGE  (hot) (cold)
objc[6186]: [0x12e00b038]       0x12d20c0f0  _NSSwiftProcessInfo
objc[6186]: [0x12e00b040]       0x12d304d20  Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048]       0x12d304f30  __NSCFCharacterSet
objc[6186]: [0x12e00b050]       0x12d3061e0  __NSCFString
objc[6186]: [0x12e00b058]       0x12c64cdf0  __NSCFString
objc[6186]: [0x12e00b060]       0x12c79d6b0  __NSCFString
objc[6186]: [0x12e00b068]  ################  POOL 0x12e00b068
objc[6186]: [0x12e00b070]       0x11c635370  __NSCFString
objc[6186]: [0x12e00b078]       0x141619730  MPSMatrixDescriptor
objc[6186]: [0x12e00b080]       0x1491eabe0  MPSMatrixDescriptor
objc[6186]: [0x12e00b088]       0x14911b550  MPSMatrixDescriptor
objc[6186]: [0x12e00b090]       0x1496f2b40  __NSCFString
objc[6186]: [0x12e00b098]       0x1491a0d80  __NSCFString
objc[6186]: [0x12e00b0a0]       0x13b718b50  __NSBundleTables
objc[6186]: [0x12e00b0a8]       0x12d33d8e0  NSBundle  autorelease count 3
objc[6186]: [0x12e00b0b0]       0x149152250  NSURL
objc[6186]: [0x12e00b0b8]       0x149111be0  __NSCFString
objc[6186]: [0x12e00b0c0]       0x14913f620  AGXG13XFamilyCommandBuffer
objc[6186]: [0x12e00b0c8]       0x14977a970  __NSArrayM
objc[6186]: [0x12e00b0d0]       0x14978b090  __NSArrayM
objc[6186]: ##############
Iteration 2
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 12 releases pending.
objc[6186]: [0x12e00b000]  ................  PAGE  (hot) (cold)
objc[6186]: [0x12e00b038]       0x12d20c0f0  _NSSwiftProcessInfo
objc[6186]: [0x12e00b040]       0x12d304d20  Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048]       0x12d304f30  __NSCFCharacterSet
objc[6186]: [0x12e00b050]       0x12d3061e0  __NSCFString
objc[6186]: [0x12e00b058]       0x12c64cdf0  __NSCFString
objc[6186]: [0x12e00b060]       0x12c79d6b0  __NSCFString
objc[6186]: [0x12e00b068]  ################  POOL 0x12e00b068
objc[6186]: [0x12e00b070]       0x12c7ff7d0  __NSCFString
objc[6186]: [0x12e00b078]       0x13b7c3da0  MPSMatrixDescriptor
objc[6186]: [0x12e00b080]       0x13b7fc3b0  MPSMatrixDescriptor
objc[6186]: [0x12e00b088]       0x13b714930  MPSMatrixDescriptor
objc[6186]: [0x12e00b090]       0x148cb99a0  AGXG13XFamilyCommandBuffer
objc[6186]: ##############

[NSAutoreleasePool showPools]

christiangnrd commented 1 month ago

Apparently this looks like an ARC bug.

Are we using ARC in Julia?

tgymnich commented 1 month ago

We don’t use ARC, but the libraries we are using might have been compiled with ARC enabled.

christiangnrd commented 1 month ago

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

maleadt commented 1 month ago

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

AFAIU -fobjc-arc make the compiler automatically insert release/retain/autorelease calls, and doesn't affect how precompiled libraries like MPS may behave.

christiangnrd commented 1 month ago

When I turned off ARC in XCode for the objc version, even with the @autoreleasepool blocks the NaNs show up.

AFAIU -fobjc-arc make the compiler automatically insert release/retain/autorelease calls, and doesn't affect how precompiled libraries like MPS may behave.

That's my understanding too. However, from what I understand about our implementation of the @autoreleasepool macro, we're using an NSAutoreleasePool object and a [pool release]; statement at the end, which according to the documentation, isn't possible with ARC on. By turning ARC off for the objc version, I was trying to reproduce the conditions of the failing Julia code.

The only thing is that I don't know it this information is actually helpful.

JuliaGPU / Metal.jl

M1/M2: Large matrix multiplications can contains NaNs #381