MNeMoNiCuZ / joy-caption-batch

A batch captioning tool for joy_caption
MIT License
127 stars 12 forks source link

question: speed #15

Closed kalle07 closed 1 month ago

kalle07 commented 1 month ago

is that normal on 16GB rtx4060

(venv) g:\caption\joy-caption-batch>g:\caption\joy-caption-batch\venv\Scripts\python.exe batch.py Captioning Batch Images Initializing... image_adapter.pt already exists. Captioning Initializing Found 21 files to process in g:\caption\joy-caption-batch\input Loading CLIP Loading tokenizer Loading LLM Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.07it/s] Some parameters are on the meta device because they were offloaded to the cpu. Loading image adapter g:\caption\joy-caption-batch\batch.py:135: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. image_adapter.load_state_dict(torch.load(CHECKPOINT_PATH / "image_adapter.pt", map_location="cpu")) Processing images: 0%| | 0/21 [00:00<?, ?it/s]g:\caption\joy-caption-batch\venv\lib\site-packages\transformers\models\siglip\modeling_siglip.py:573: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.) attn_output = torch.nn.functional.scaled_dot_product_attention( Processing images: 10%|█████▊ | 2/21 [06:47<1:04:02, 202.25s/it]

i have other caption-software also models with 16GB+ and one image need 3 s/it not 200 s/it like this

MNeMoNiCuZ commented 1 month ago

Hmm, no this doesn't make much sense. It should not be this slow, BUT, this line worries me in your log: Some parameters are on the meta device because they were offloaded to the cpu.

Can you run the nvcc --version command in your active venv and paste the results here? My gut feeling is that Cuda still is not appropriately installed for you. Same as the issue in the other thread.

image 20 images is 2-3 minutes with a 3090 24gb.

Also check out the FAQ at the bottom of the page: https://github.com/MNeMoNiCuZ/joy-caption-batch?tab=readme-ov-file#not-using-the-right-gpu

You may need to add this line of code if it's not utilizing the correct GPU if it detects multiple.

And additionally, you may want to enable a quantized version by setting the following code to true:

LOW_VRAM_MODE=true

And lastly, what size are the images you are converting? If they are too big, this may cause OOM issues, which could lead the memory to overflow to the CPU or RAM or something. I don't know how that works, but it could be an issue. Make sure the images are reasonably sized. No length should require more than 1024x1024. Some of my other batch script has resizing functionality, but I didn't add it to this one as it doesn't need it.

kalle07 commented 1 month ago

(venv) g:\caption\joy-caption-batch>nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023 Cuda compilation tools, release 12.2, V12.2.91 Build cuda_12.2.r12.2/compiler.32965470_0

(venv) g:\caption\joy-caption-batch>

maybe only 12.1 work ?

it is using the right device i see it in GPU-Z

set low vram true it loads a new model "llama 3" why not "llama 3.1" https://huggingface.co/unsloth/Meta-Llama-3.1-8B-bnb-4bit ;)

yes now works faster (the images are arround 1024pix so that should not be the issue)

MNeMoNiCuZ commented 1 month ago

why not "llama 3.1" I've heard that 3.0 performs better. But you can change it to whatever you want. I haven't run any tests myself, so I kept it to 3.0 since I know this one works great.

maybe only 12.1 work ? Maybe, but it sounds like it works for you now? So I guess it was a VRAM issue.