Closed kalle07 closed 1 month ago
Hmm, no this doesn't make much sense.
It should not be this slow, BUT, this line worries me in your log: Some parameters are on the meta device because they were offloaded to the cpu.
Can you run the nvcc --version
command in your active venv and paste the results here?
My gut feeling is that Cuda still is not appropriately installed for you. Same as the issue in the other thread.
20 images is 2-3 minutes with a 3090 24gb.
Also check out the FAQ at the bottom of the page: https://github.com/MNeMoNiCuZ/joy-caption-batch?tab=readme-ov-file#not-using-the-right-gpu
You may need to add this line of code if it's not utilizing the correct GPU if it detects multiple.
And additionally, you may want to enable a quantized version by setting the following code to true:
LOW_VRAM_MODE=true
And lastly, what size are the images you are converting? If they are too big, this may cause OOM issues, which could lead the memory to overflow to the CPU or RAM or something. I don't know how that works, but it could be an issue. Make sure the images are reasonably sized. No length should require more than 1024x1024. Some of my other batch script has resizing functionality, but I didn't add it to this one as it doesn't need it.
(venv) g:\caption\joy-caption-batch>nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023 Cuda compilation tools, release 12.2, V12.2.91 Build cuda_12.2.r12.2/compiler.32965470_0
(venv) g:\caption\joy-caption-batch>
maybe only 12.1 work ?
it is using the right device i see it in GPU-Z
set low vram true it loads a new model "llama 3" why not "llama 3.1" https://huggingface.co/unsloth/Meta-Llama-3.1-8B-bnb-4bit ;)
yes now works faster (the images are arround 1024pix so that should not be the issue)
why not "llama 3.1" I've heard that 3.0 performs better. But you can change it to whatever you want. I haven't run any tests myself, so I kept it to 3.0 since I know this one works great.
maybe only 12.1 work ? Maybe, but it sounds like it works for you now? So I guess it was a VRAM issue.
is that normal on 16GB rtx4060
(venv) g:\caption\joy-caption-batch>g:\caption\joy-caption-batch\venv\Scripts\python.exe batch.py Captioning Batch Images Initializing... image_adapter.pt already exists. Captioning Initializing Found 21 files to process in g:\caption\joy-caption-batch\input Loading CLIP Loading tokenizer Loading LLM Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.07it/s] Some parameters are on the meta device because they were offloaded to the cpu. Loading image adapter g:\caption\joy-caption-batch\batch.py:135: FutureWarning: You are using
torch.load
withweights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value forweights_only
will be flipped toTrue
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user viatorch.serialization.add_safe_globals
. We recommend you start settingweights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. image_adapter.load_state_dict(torch.load(CHECKPOINT_PATH / "image_adapter.pt", map_location="cpu")) Processing images: 0%| | 0/21 [00:00<?, ?it/s]g:\caption\joy-caption-batch\venv\lib\site-packages\transformers\models\siglip\modeling_siglip.py:573: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.) attn_output = torch.nn.functional.scaled_dot_product_attention( Processing images: 10%|█████▊ | 2/21 [06:47<1:04:02, 202.25s/it]i have other caption-software also models with 16GB+ and one image need 3 s/it not 200 s/it like this