Open chengchingwen opened 4 months ago
adding wait_completed
on matmul!
's command buffer does not help
Adding Metal.@sync
to the mul!
also does not help. ~However, I cannot reproduce when calling MPS.matmul!
directly.~
I cannot reproduce at all on Metal.jl#master using an M3 Pro, but it does seem reproducible on an M1 Pro.
I wonder if this is a problem with mapreduce
, since you're calling isapprox
on GPU arrays. Can you test if calling @assert Array(C) ≈ Array(c)
makes things pass? It does here, at least.
I can reproduce the issue on M1 master. It also looks like all the tasks run on the same queue.
The issue was found on a M2 Max. The MWE only happens if the array is large enough. It seems to be launching the subsequent kernel before the matmul finished. Is it possible that the mapreduce
not checking the availability of the input arrays?
p.s. I'm about to board the plane to JuliaCon so I won't be able to test it soon.
I wonder if this is a problem with
mapreduce
, since you're callingisapprox
on GPU arrays. Can you test if calling@assert Array(C) ≈ Array(c)
makes things pass? It does here, at least.
It also reproduces when comparing on the CPU, just much less likely, so this isn't a mapreduce
issue.
Looks like a bunch of NaN's in the second matrix.
My current MWE is:
using Metal, LinearAlgebra; begin
n = 10000
a = mtl(randn(Float32,n,n))
b = mtl(randn(Float32,n,n))
C = Metal.zeros(Float32, size(a))
for i in 1:10
C = Metal.zeros(Float32, size(a))
mul!(C,a,b)
@assert !any(isnan.(C)) "$i"
end
end
I define C out of the loop to access it afterwards. When I had C .= ...
in the loop instead of C = ...
. It only ever happened at iteration 1. I suspect it has to do with the location in memory of the array.
I cannot reproduce when calling
MPS.matmul!
directly
I can:
using Metal, LinearAlgebra
function main(T=Float32, N=10000)
a = Metal.rand(T, N, N)
b = Metal.rand(T, N, N)
c = a * b'
synchronize()
for i in 1:100
println("Iteration $i")
d = Metal.zeros(T, size(a))
MPS.matmul!(d, a, b, #=alpha=#true, #=beta=#false,
#=transpose_a=#false, #=transpose_b=#true)
@assert !any(isnan.(Array(d))) "NaN in iteration $i"
# XXX: this redundant check is needed, or the failure never occurs
@assert !any(isnan.(d))
end
end
isinteractive() || main()
The need for a secondary kernel is very weird.
It is not MPS related:
for i in 1:10
C = Metal.zeros(Float32, size(a))
GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())
@assert C ≈ c "$i"
end
GPUArrays.generic_matmatmul!(C, a, b, MulAddMul())
I don't see how that's related; it's an entirely different kernel. Does it contain NaNs in similar places? The generic matmatmul kernel, while being extraordinarily slow, doesn't introduce NaNs here.
Just wanted to confirm that its MPS rather than the synchronisation between kernel launches.
I've been seeing the NaN issues with large arrays for a long time in #145
MPX seems fine:
import mlx.core as mx
a = mx.random.normal((10000, 10000))
b = mx.random.normal((10000, 10000))
c = a @ b.T
for i in range(0,10):
C = a @ b.T
assert(mx.allclose(C,c))
I would love for someone to review my code because I'm not a Swift expert by any means, but I was able to reproduce this in the Swift REPL.
@christiangnrd Your Swift Code looks good to me. It turns out MPX doesn’t even use MPS.
Haven't been able to look into this, but here's the ObjC version:
#import <Foundation/Foundation.h>
#import <Metal/Metal.h>
#import <MetalPerformanceShaders/MetalPerformanceShaders.h>
void performMatrixMultiplication(NSInteger N) {
if (N == 0) {
N = 10000;
}
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLCommandQueue> commandQueue = [device newCommandQueue];
if (!device || !commandQueue) {
NSLog(@"Metal device or command queue could not be created");
return;
}
NSLog(@"Initializing a & b");
// Generate random NxN matrices
float *a = calloc(N * N, sizeof(float));
float *b = calloc(N * N, sizeof(float));
for (NSInteger i = 0; i < N * N; i++) {
a[i] = 1.0f;
b[i] = 1.0f;
}
NSLog(@"a and b created\n");
// Metal buffers for matrices
id<MTLBuffer> aBuffer = [device newBufferWithBytes:a length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
id<MTLBuffer> bBuffer = [device newBufferWithBytes:b length:sizeof(float) * N * N options:MTLResourceStorageModeShared];
NSLog(@"Starting matmul\n");
for (NSInteger i = 1; i <= 10; i++) {
NSLog(@"%ld\n", (long)i);
// Create MPSMatrices
MPSMatrixDescriptor *aMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrixDescriptor *bMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrix *aMatrix = [[MPSMatrix alloc] initWithBuffer:aBuffer descriptor:aMatrixDescriptor];
MPSMatrix *bMatrix = [[MPSMatrix alloc] initWithBuffer:bBuffer descriptor:bMatrixDescriptor];
// Matrix multiplication using MPSMatrixMultiplication
MPSMatrixMultiplication *matrixMultiplication = [[MPSMatrixMultiplication alloc] initWithDevice:device
transposeLeft:NO
transposeRight:NO
resultRows:N
resultColumns:N
interiorColumns:N
alpha:1.0
beta:0.0];
id<MTLBuffer> cBuffer = [device newBufferWithLength:sizeof(float) * N * N options:MTLResourceStorageModeShared];
MPSMatrixDescriptor *cMatrixDescriptor = [MPSMatrixDescriptor matrixDescriptorWithRows:N
columns:N
rowBytes:sizeof(float) * N
dataType:MPSDataTypeFloat32];
MPSMatrix *cMatrix = [[MPSMatrix alloc] initWithBuffer:cBuffer descriptor:cMatrixDescriptor];
id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];
[matrixMultiplication encodeToCommandBuffer:commandBuffer
leftMatrix:aMatrix
rightMatrix:bMatrix
resultMatrix:cMatrix];
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
// Check for NaNs in the result matrix
float *cPointer = cBuffer.contents;
for (NSInteger j = 0; j < N * N; j++) {
if (isnan(cPointer[j])) {
NSLog(@"NaN in iteration %ld", (long)i);
free(a);
free(b);
return;
}
}
}
free(a);
free(b);
}
int main(int argc, const char * argv[]) {
@autoreleasepool {
NSInteger N = 10000;
if (argc > 1) {
N = atoi(argv[1]);
}
performMatrixMultiplication(N);
}
return 0;
}
❯ clang mps.m -o mps -framework Foundation -framework Metal -framework MetalPerformanceShaders -fobjc-arc -mmacosx-version-min=10.13
❯ ./mps
2024-07-13 12:23:11.771 mps[54256:2493528] Initializing a & b
2024-07-13 12:23:11.931 mps[54256:2493528] a and b created
2024-07-13 12:23:12.001 mps[54256:2493528] Starting matmul
2024-07-13 12:23:12.001 mps[54256:2493528] 1
2024-07-13 12:23:13.933 mps[54256:2493528] 2
2024-07-13 12:23:15.477 mps[54256:2493528] 3
2024-07-13 12:23:16.997 mps[54256:2493528] 4
2024-07-13 12:23:18.440 mps[54256:2493528] NaN in iteration 4
Should we just file a radar / feedback?
I'll have a better look first and forward it to our Apple contact.
Apparently this looks like an ARC bug. Curiously, the ObjC reproducer is "fixed" by adding an @autoreleasepool
around the for loop body, but the same doesn't hold in Julia (in fact, the original issue was calling into mul!
which is already marked @autoreleasepool
).
Of course, the Julia MWE is more complex, as the @assert !any(isnan.(d))
involves two additional kernels...
Couldn't reproduce the ObjectiveC case today with and without autoreleasepool
.
Swift and Julia were still reproducible.
I can reproduce the error in both Swift and ObjectiveC and it goes away when surrounded by an autoreleasepool
block in both languages.
Oops. I just overlooked the second autoreleasepool
. The first one is actually not necessary (at least to hide our bug.)
By "the first one" do you mean the autoreleasepool
in main
?
I'm able to reproduce this without the second redundant check.
Our NSAutoreleasePool
seems to contain roughly the same objects before the nan check compared to the objc version from above. Most obvious difference is that the correct objc version has a CaptureMTLDevice
and a AGXG13XFamilyComputeContext
and we have a AGXG13XFamilyCommandBuffer
(could be debug / xcode related).
iteration 1
objc[6905]: ##############
objc[6905]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6905]: 77 releases pending.
objc[6905]: [0x14300d000] ................ PAGE (hot) (cold)
objc[6905]: [0x14300d038] ################ POOL 0x14300d038
objc[6905]: [0x14300d040] 0x6000004c4860 __NSSingleEntryDictionaryI
objc[6905]: [0x14300d048] 0x6000027ccd20 NSBundle autorelease count 2
objc[6905]: [0x14300d050] 0x6000004cfd40 __NSDictionaryM autorelease count 2
objc[6905]: [0x14300d058] 0x600002fc8690 MTLCommandQueueDescriptorInternal
objc[6905]: [0x14300d060] 0x600000ac0090 NSUserDefaults autorelease count 4
objc[6905]: [0x14300d068] 0x6000004c4b20 __NSSingleEntryDictionaryI
objc[6905]: [0x14300d070] 0x6000004d4660 __NSSingleEntryDictionaryI
objc[6905]: [0x14300d078] 0x6000004d4220 __NSSingleEntryDictionaryI
objc[6905]: [0x14300d080] 0x6000004d46a0 __NSSingleEntryDictionaryI
objc[6905]: [0x14300d088] ################ POOL 0x14300d088
objc[6905]: [0x14300d090] 0x6000011ce180 MPSMatrixDescriptor
objc[6905]: [0x14300d098] 0x6000011cde00 MPSMatrixDescriptor
objc[6905]: [0x14300d0a0] 0x145809000 AGXG13XDevice autorelease count 15
objc[6905]: [0x14300d0a8] 0x144105550 CaptureMTLDevice autorelease count 4
objc[6905]: [0x14300d0b0] 0x6000011cc540 __NSCFString
objc[6905]: [0x14300d0b8] 0x600002ac8d80 __NSCFString
objc[6905]: [0x14300d0c0] 0x600003dcc540 NSPathStore2
objc[6905]: [0x14300d0c8] 0x6000011cc600 __NSBundleTables autorelease count 3
objc[6905]: [0x14300d0d0] 0x6000027dc140 NSBundle autorelease count 2
objc[6905]: [0x14300d0d8] 0x6000027cd0e0 NSBundle
objc[6905]: [0x14300d0e0] 0x6000020cc480 NSURL
objc[6905]: [0x14300d0e8] 0x6000035cc500 __NSCFString
objc[6905]: [0x14300d0f0] 0x6000004dd4e0 NSFileManager
objc[6905]: [0x14300d0f8] 0x6000020cc5a0 NSURL
objc[6905]: [0x14300d100] 0x6000035cc280 __NSCFString
objc[6905]: [0x14300d108] 0x6000035cc6e0 __NSCFString autorelease count 2
objc[6905]: [0x14300d110] 0x6000004df3e0 NSConcreteData
objc[6905]: [0x14300d118] 0x6000027cd810 Swift.__StringStorage
objc[6905]: [0x14300d120] 0x6000027cd860 Swift.__StringStorage
objc[6905]: [0x14300d128] 0x6000027cd8b0 Swift.__StringStorage
objc[6905]: [0x14300d130] 0x6000027cd900 Swift.__StringStorage
objc[6905]: [0x14300d138] 0x6000027cd950 Swift.__StringStorage
objc[6905]: [0x14300d140] 0x6000027cd9a0 Swift.__StringStorage
objc[6905]: [0x14300d148] 0x6000035cc6e0 __NSCFString autorelease count 6
objc[6905]: [0x14300d150] 0x6000011d4980 MPSMatrixDescriptor
objc[6905]: [0x14300d158] 0x144105550 CaptureMTLDevice autorelease count 2
objc[6905]: [0x14300d160] 0x6000036ceeb0 AGXG13XFamilyComputeContext
objc[6905]: [0x14300d168] 0x6000011d4b80 __NSCFString
objc[6905]: [0x14300d170] 0x600002acaf80 __NSCFString
objc[6905]: [0x14300d178] 0x600003dcc2a0 NSPathStore2
objc[6905]: [0x14300d180] 0x6000011cc600 __NSBundleTables autorelease count 3
objc[6905]: [0x14300d188] 0x6000027cd0e0 NSBundle
objc[6905]: [0x14300d190] 0x6000027dc140 NSBundle
objc[6905]: [0x14300d198] 0x6000027cda90 NSBundle autorelease count 2
objc[6905]: [0x14300d1a0] 0x6000020cc600 NSURL
objc[6905]: [0x14300d1a8] 0x6000035cc8c0 __NSCFString
objc[6905]: [0x14300d1b0] 0x6000004df980 NSFileManager
objc[6905]: [0x14300d1b8] 0x6000020cc6c0 NSURL
objc[6905]: [0x14300d1c0] 0x6000035ccb40 __NSCFString
objc[6905]: [0x14300d1c8] 0x6000035cc960 __NSCFString autorelease count 2
objc[6905]: [0x14300d1d0] 0x6000004d1e40 NSConcreteData
objc[6905]: [0x14300d1d8] 0x6000027cdbd0 Swift.__StringStorage
objc[6905]: [0x14300d1e0] 0x6000027cdc20 Swift.__StringStorage
objc[6905]: [0x14300d1e8] 0x6000027cdc70 Swift.__StringStorage
objc[6905]: [0x14300d1f0] 0x6000027cdcc0 Swift.__StringStorage
objc[6905]: [0x14300d1f8] 0x6000027cdd10 Swift.__StringStorage
objc[6905]: [0x14300d200] 0x6000027cdd60 Swift.__StringStorage
objc[6905]: [0x14300d208] 0x6000035cc960 __NSCFString autorelease count 6
objc[6905]: [0x14300d210] 0x144105550 CaptureMTLDevice autorelease count 2
objc[6905]: [0x14300d218] 0x600000a80330 __NSArrayM
objc[6905]: [0x14300d220] 0x600000a80360 __NSArrayM
objc[6905]: [0x14300d228] 0x6000004d2f40 __NSCFString
objc[6905]: [0x14300d230] 0x6000004d2ec0 __NSCFString
objc[6905]: [0x14300d238] 0x6000004d2ee0 __NSCFString
objc[6905]: [0x14300d240] 0x6000004d2f00 __NSCFString
objc[6905]: [0x14300d248] 0x6000008e7240 __NSCFString
objc[6905]: [0x14300d250] 0x6000008e7090 __NSCFString
objc[6905]: [0x14300d258] 0x6000008e70c0 __NSCFString
objc[6905]: [0x14300d260] 0x6000008e70f0 __NSCFString
objc[6905]: [0x14300d268] 0x6000004d2f20 __NSCFString
objc[6905]: [0x14300d270] 0x6000004d2ea0 __NSCFString
objc[6905]: [0x14300d278] 0x6000004d2fc0 __NSCFString
objc[6905]: [0x14300d280] 0x6000008e6e20 __NSArrayM
objc[6905]: [0x14300d288] 0x6000004d3140 __NSCFNumber
objc[6905]: [0x14300d290] 0x14304b800 __NSCFString
objc[6905]: [0x14300d298] 0x6000020cc780 MTLComputePipelineReflectionInternal
objc[6905]: ##############
iteration 2
objc[36563]: ##############
objc[36563]: AUTORELEASE POOLS for thread 0x203b9b240
objc[36563]: 16 releases pending.
objc[36563]: [0x14080a000] ................ PAGE (hot) (cold)
objc[36563]: [0x14080a038] ################ POOL 0x14080a038
objc[36563]: [0x14080a040] 0x600001f3c5a0 __NSSingleEntryDictionaryI
objc[36563]: [0x14080a048] 0x600003c202d0 NSBundle autorelease count 2
objc[36563]: [0x14080a050] 0x600001f2e7a0 __NSDictionaryM autorelease count 2
objc[36563]: [0x14080a058] 0x60000342c0e0 MTLCommandQueueDescriptorInternal
objc[36563]: [0x14080a060] 0x60000112c2a0 NSUserDefaults autorelease count 4
objc[36563]: [0x14080a068] 0x600001f3cb00 __NSSingleEntryDictionaryI
objc[36563]: [0x14080a070] 0x600001f3c4e0 __NSSingleEntryDictionaryI
objc[36563]: [0x14080a078] 0x600001f3cac0 __NSSingleEntryDictionaryI
objc[36563]: [0x14080a080] 0x600001f3cae0 __NSSingleEntryDictionaryI
objc[36563]: [0x14080a088] ################ POOL 0x14080a088
objc[36563]: [0x14080a090] 0x600000aa2740 MPSMatrixDescriptor
objc[36563]: [0x14080a098] 0x600000aa2780 MPSMatrixDescriptor
objc[36563]: [0x14080a0a0] 0x600000a21040 MPSMatrixDescriptor
objc[36563]: [0x14080a0a8] 0x141005410 CaptureMTLDevice autorelease count 6
objc[36563]: [0x14080a0b0] 0x600002d24510 AGXG13XFamilyComputeContext
objc[36563]: ##############
Iteration 1
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 20 releases pending.
objc[6186]: [0x12e00b000] ................ PAGE (hot) (cold)
objc[6186]: [0x12e00b038] 0x12d20c0f0 _NSSwiftProcessInfo
objc[6186]: [0x12e00b040] 0x12d304d20 Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048] 0x12d304f30 __NSCFCharacterSet
objc[6186]: [0x12e00b050] 0x12d3061e0 __NSCFString
objc[6186]: [0x12e00b058] 0x12c64cdf0 __NSCFString
objc[6186]: [0x12e00b060] 0x12c79d6b0 __NSCFString
objc[6186]: [0x12e00b068] ################ POOL 0x12e00b068
objc[6186]: [0x12e00b070] 0x11c635370 __NSCFString
objc[6186]: [0x12e00b078] 0x141619730 MPSMatrixDescriptor
objc[6186]: [0x12e00b080] 0x1491eabe0 MPSMatrixDescriptor
objc[6186]: [0x12e00b088] 0x14911b550 MPSMatrixDescriptor
objc[6186]: [0x12e00b090] 0x1496f2b40 __NSCFString
objc[6186]: [0x12e00b098] 0x1491a0d80 __NSCFString
objc[6186]: [0x12e00b0a0] 0x13b718b50 __NSBundleTables
objc[6186]: [0x12e00b0a8] 0x12d33d8e0 NSBundle autorelease count 3
objc[6186]: [0x12e00b0b0] 0x149152250 NSURL
objc[6186]: [0x12e00b0b8] 0x149111be0 __NSCFString
objc[6186]: [0x12e00b0c0] 0x14913f620 AGXG13XFamilyCommandBuffer
objc[6186]: [0x12e00b0c8] 0x14977a970 __NSArrayM
objc[6186]: [0x12e00b0d0] 0x14978b090 __NSArrayM
objc[6186]: ##############
Iteration 2
objc[6186]: ##############
objc[6186]: AUTORELEASE POOLS for thread 0x203b9b240
objc[6186]: 12 releases pending.
objc[6186]: [0x12e00b000] ................ PAGE (hot) (cold)
objc[6186]: [0x12e00b038] 0x12d20c0f0 _NSSwiftProcessInfo
objc[6186]: [0x12e00b040] 0x12d304d20 Swift.__SwiftDeferredNSArray
objc[6186]: [0x12e00b048] 0x12d304f30 __NSCFCharacterSet
objc[6186]: [0x12e00b050] 0x12d3061e0 __NSCFString
objc[6186]: [0x12e00b058] 0x12c64cdf0 __NSCFString
objc[6186]: [0x12e00b060] 0x12c79d6b0 __NSCFString
objc[6186]: [0x12e00b068] ################ POOL 0x12e00b068
objc[6186]: [0x12e00b070] 0x12c7ff7d0 __NSCFString
objc[6186]: [0x12e00b078] 0x13b7c3da0 MPSMatrixDescriptor
objc[6186]: [0x12e00b080] 0x13b7fc3b0 MPSMatrixDescriptor
objc[6186]: [0x12e00b088] 0x13b714930 MPSMatrixDescriptor
objc[6186]: [0x12e00b090] 0x148cb99a0 AGXG13XFamilyCommandBuffer
objc[6186]: ##############
[NSAutoreleasePool showPools]
Apparently this looks like an ARC bug.
Are we using ARC in Julia?
We don’t use ARC, but the libraries we are using might have been compiled with ARC enabled.
When I turned off ARC in XCode for the objc version, even with the @autoreleasepool
blocks the NaNs show up.
When I turned off ARC in XCode for the objc version, even with the
@autoreleasepool
blocks the NaNs show up.
AFAIU -fobjc-arc
make the compiler automatically insert release
/retain
/autorelease
calls, and doesn't affect how precompiled libraries like MPS may behave.
When I turned off ARC in XCode for the objc version, even with the
@autoreleasepool
blocks the NaNs show up.AFAIU
-fobjc-arc
make the compiler automatically insertrelease
/retain
/autorelease
calls, and doesn't affect how precompiled libraries like MPS may behave.
That's my understanding too. However, from what I understand about our implementation of the @autoreleasepool
macro, we're using an NSAutoreleasePool
object and a [pool release];
statement at the end, which according to the documentation, isn't possible with ARC on. By turning ARC off for the objc version, I was trying to reproduce the conditions of the failing Julia code.
The only thing is that I don't know it this information is actually helpful.
MWE: