macOS 15.0 (24A335) M1 buffer is not large enough and resource_tracker: There appear to be %d

guoreex commented 1 month ago

I'm not sure if this question is appropriate to ask here, I'm not a professional programmer, if anyone is willing to offer help and guidance, I would be very grateful.

Two weeks ago, I started using the GGUF model, and it can work normally. Today, I upgraded the system of the MacBook pro m1 computer to the latest version of macOS 15.0 (24A335). An error prompt occurred when running GGUF workflow in comfyUI:

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/AppleInternal/Library/BuildRoots/5a8a3fcc-55cb-11ef-848e-8a553ba56670/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:891: failed assertion `[MPSNDArray, initWithBufferImpl:offset:descriptor:isForNDArrayAlias:isUserBuffer:] Error: buffer is not large enough. Must be 63700992 bytes
'/Users/***/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

My system information： Python version: 3.11.5 (main, Sep 11 2023, 08:31:25) [Clang 14.0.6 ] pytorch version: 2.6.0.dev20240916 ComfyUI Revision: 2701 [7183fd16] | Released on '2024-09-17'

I didn't know if this is related to updating the system. thx

city96 commented 1 month ago

Could you test with the FP16/FP8 model and the default nodes w/o the custom node pack? Might be more appropriate for the ComfyUI repo if it still happens with those since the error makes it sound like it's not a problem with this node pack, I could be wrong though.

Also makes it sound like you can set the env variable export TOKENIZERS_PARALLELISM=false to possibly fix it? Might be worth testing.

guoreex commented 1 month ago

Thank you for your reply.

My computer only has 16G ram, which is not enough to run the FP8 model.

set export TOKENIZERS_PARALLELISM=false There are still mistakes：

...
Requested to load FluxClipModel_
Loading 1 new model
loaded completely 0.0 323.94775390625 True
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
loaded completely 0.0 6456.9610595703125 True
  0%|                                                              | 0/4 [00:00<?, ?it/s]/AppleInternal/Library/BuildRoots/5a8a3fcc-55cb-11ef-848e-8a553ba56670/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:891: failed assertion `[MPSNDArray, initWithBufferImpl:offset:descriptor:isForNDArrayAlias:isUserBuffer:] Error: buffer is not large enough. Must be 63700992 bytes
'
/Users/***/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

The error occurs after starting the generation calculation.

city96 commented 1 month ago

Well, at least there's a progress bar now lol, buffer error is still there though...

I don't have any apple device to test on, but looks like there's a similar issue on the pytorch tracker with a linked PR, not sure if the cause is the same though. Might be worth keeping an eye on and testing on latest nightly once it gets merged? https://github.com/pytorch/pytorch/issues/136132

tombearx commented 1 month ago

Still have the issue using today's nightly build. Any one else?

thenabytes commented 1 month ago

M2 Macbook Air, 16GB RAM Sequoia 15.0 Python version: 3.12.6 (main, Sep 6 2024, 19:03:47) [Clang 15.0.0 (clang-1500.3.9.4)] pytorch version: 2.6.0.dev20240923 ComfyUI Revision: 2724 [3a0eeee3] | Released on '2024-09-23'

Requested to load Flux
Loading 1 new model
loaded completely 0.0 7867.7110595703125 True
  0%|                                                     | 0/20 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/AppleInternal/Library/BuildRoots/5a8a3fcc-55cb-11ef-848e-8a553ba56670/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:891: failed assertion `[MPSNDArray, initWithBufferImpl:offset:descriptor:isForNDArrayAlias:isUserBuffer:] Error: buffer is not large enough. Must be 77856768 bytes

jonny7737 commented 1 month ago

M2 Max Mac Studio, 64GB RAM Sequoia 15.0 Python 3.11.9

Only when running GGUF models (fp16 fp8 work fine)

/AppleInternal/Library/BuildRoots/5a8a3fcc-55cb-11ef-848e-8a553ba56670/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:891: failed assertion `[MPSNDArray, initWithBufferImpl:offset:descriptor:isForNDArrayAlias:isUserBuffer:] Error: buffer is not large enough. Must be 77856768 bytes

<<Slight correction: flux1-dev-Q8_0.GGUF WORKS!!>> Correcting the correction: Q8 does not work (working test was before Sequoia)

tombearx commented 1 month ago

M2 Max Mac Studio, 64GB RAM Sequoia 15.0 Python 3.11.9

Only when running GGUF models (fp16 fp8 work fine)

/AppleInternal/Library/BuildRoots/5a8a3fcc-55cb-11ef-848e-8a553ba56670/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:891: failed assertion `[MPSNDArray, initWithBufferImpl:offset:descriptor:isForNDArrayAlias:isUserBuffer:] Error: buffer is not large enough. Must be 77856768 bytes

Slight correction: flux1-dev-Q8_0.GGUF WORKS!!

Does Q8 work? What PyTorch version are you using?

jonny7737 commented 1 month ago

M2 Max Mac Studio, 64GB RAM Sequoia 15.0 Python 3.11.9 Only when running GGUF models (fp16 fp8 work fine) /AppleInternal/Library/BuildRoots/5a8a3fcc-55cb-11ef-848e-8a553ba56670/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:891: failed assertion `[MPSNDArray, initWithBufferImpl:offset:descriptor:isForNDArrayAlias:isUserBuffer:] Error: buffer is not large enough. Must be 77856768 bytes Slight correction: flux1-dev-Q8_0.GGUF WORKS!!

Does Q8 work? What PyTorch version are you using?

I just retested Q8 and it does not work :( Working test was before Sequoia. Sorry for the false hope.

jonny7737 commented 1 month ago

This is the only GGUF that I have found to work since Sequoia update:

https://huggingface.co/city96/FLUX.1-dev-gguf/blob/main/flux1-dev-F16.gguf

tombearx commented 1 month ago

Guys, I've tested torch==2.4.1 and it works for gguf Q8.

jonny7737 commented 1 month ago

What is the mac config for your test? M? xxGB

Can't install pytorch==2.4.1 because it requires python < 3.9

tombearx commented 1 month ago

Strange, I use python 3.11.

M1 Max, 32gb

jonny7737 commented 1 month ago

I use 3.11 as well but the install of torch 2.4.1 failed due to python version. Very strange. I'll try again. Thanks.

bauerwer commented 1 month ago

Same issue here, flux GGUF's bail out with a mem allocation error in MPS (Error: buffer is not large enough. Must be 77856768 bytes). Worked on Mac OS 14.x but not anymore on Mac OS 15.x. same issue with torch 2.4.1 and 2.6.0.dev20240924 (nightly from last week). As reference and as I can run heavier flux (M3 Max, 128GB RAM), the direct flux models work fine. Would love to run GGUFs though due to less RAM and speed.

jonny7737 commented 1 month ago

FINALLY!!! After 6 tries to get pytorch 2.4.1 to install, the install completed successfully. A simple test with a Q5 GGUF model and it did not abort comfyui. But, the image generated at an absolutely appauling 45 seconds per iteration.

It works but is not usable.

cchance27 commented 3 weeks ago

Thers something going on with every nightly build thats the issue for some reason the 2.6 nightlies all break the GGUF code for some reason running 32gb that works fine with Q8 on 2.4.1 fails every time with this semaphore error when moved to nightly.

I can't say if its seqouia + 2.6 nightlies but can confirm sequoia + 2.4.1 + gguf works fine, sequoia + 2.6 + gguf bails every time

This is super annoying because the 2.6 nightly finally added support for autocast on MPS

craii commented 3 weeks ago

Thers something going on with every nightly build thats the issue for some reason the 2.6 nightlies all break the GGUF code for some reason running 32gb that works fine with Q8 on 2.4.1 fails every time with this semaphore error when moved to nightly.

I can't say if its seqouia + 2.6 nightlies but can confirm sequoia + 2.4.1 + gguf works fine, sequoia + 2.6 + gguf bails every time

This is super annoying because the 2.6 nightly finally added support for autocast on MPS

thank you bro! By using pytorch 2.4.1, It works again!

craii commented 3 weeks ago

Thers something going on with every nightly build thats the issue for some reason the 2.6 nightlies all break the GGUF code for some reason running 32gb that works fine with Q8 on 2.4.1 fails every time with this semaphore error when moved to nightly.

I can't say if its seqouia + 2.6 nightlies but can confirm sequoia + 2.4.1 + gguf works fine, sequoia + 2.6 + gguf bails every time

This is super annoying because the 2.6 nightly finally added support for autocast on MPS

@city96 Hello Bro. I think this could be added to readme as a temporary fix guide.

city96 commented 3 weeks ago

@craii Added it under the installation section w/ a link to this issue thread.

cchance27 commented 3 weeks ago

appauling 45 seconds per iteration.

Just so you know i haven't tested them all Q8_0 on M3 and torch 2.4.1 i get ~16-17s/it ... on Q5 and Q8_4 (i've been playing with custom quants) and they are 40-50s/it its insane not sure why it's so bad, but ya, Q8_0 loads and runs fastest so far.

Vargol commented 3 weeks ago

Q8 is faster because it can run fully on the GPU units, the others use a shift function that has to fallback to running on the CPU.

For example if Comfy is not hiding it in the terminal you should something see this

: The operator 'aten::__rshift__.Tensor' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)

when using the other models, this was taken from a Q6_K run in InvokeAI.

jack813 commented 2 weeks ago

After Ptorch nightly 2.6.0 dev20241020 version, the problem has been fixed. I can run GGUF's quantized Flux.1 Dev Q4_0 version on my Mac book m1 Pro Memory: 16GB

jonny7737 commented 2 weeks ago

After Ptorch nightly 2.6.0 dev20241020 version, the problem has been fixed. I can run GGUF's quantized Flux.1 Dev Q4_0 version on my Mac book m1 Pro Memory: 16GB

jonny7737 commented 2 weeks ago

M2 Max 64GB after installing the 241020 nightly, GGUF seems to work again. Thanks for the heads up.

jeanjerome commented 2 weeks ago

I also managed to get a GGUF working with pytorch 2.6.0.dev20241020 py3.10_0 pytorch-nightly on Sequoia 15.0.1.

ReZeroS commented 1 week ago

conda install pytorch-nightly::pytorch torchvision torchaudio -c pytorch-nightly

jeanjerome commented 1 week ago

Or simply conda install pytorch torchvision torchaudio -c pytorch-nightly (https://developer.apple.com/metal/pytorch/)

craii commented 1 week ago

M3 24GB works properly on Q4 schnell model after pytorch dev-20241020-nightly installed. But it seems to consume much more memory when given the same parameters to generate pictures(Now it takes 25~29gb while only 17~20gb was taken before)

cchance27 commented 1 week ago

Just use 2.4.1 not nightly, report the regression to PyTorch team they already fixed some of the other regressions

city96 / ComfyUI-GGUF

macOS 15.0 (24A335) M1 buffer is not large enough and resource_tracker: There appear to be %d #107