Inference Problem - Githubissues

LKAMING97 commented 3 weeks ago

Why does it take so long to infer just two pictures?

![Uploading image.png…]()

LKAMING97 commented 3 weeks ago

python interleaved_generation.py -i 'Please introduce the city of Gyumri with pictures.' -s "./test/"

LKAMING97 commented 3 weeks ago

It was running for ages so I stopped

Instruction: draw a dog
Batch size: 2
VQModel loaded from data/tokenizer/vqgan.ckpt
^CTraceback (most recent call last):
  File "/root/autodl-tmp/anole/text2image.py", line 71, in <module>
    main(args)
  File "/root/autodl-tmp/anole/text2image.py", line 46, in main
    image_tokens: torch.LongTensor = model.generate(
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 665, in generate
    tokens = [t.id for t in self.stream(*args, **kwargs)]
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 665, in <listcomp>
    tokens = [t.id for t in self.stream(*args, **kwargs)]
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 649, in stream
    while key_token := self.dctx.res_q.get():
  File "/root/miniconda3/lib/python3.10/multiprocessing/queues.py", line 103, in get
    res = self._recv_bytes()
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 221, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 419, in _recv_bytes
    buf = self._recv(4)
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 384, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
^CException ignored in atexit callback: <function _exit_function at 0x7f7291b91b40>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.10/multiprocessing/util.py", line 334, in _exit_function
    _run_finalizers(0)
  File "/root/miniconda3/lib/python3.10/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/root/miniconda3/lib/python3.10/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.10/multiprocessing/managers.py", line 674, in _finalize_manager
    process.join(timeout=1.0)
  File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/root/miniconda3/lib/python3.10/multiprocessing/popen_fork.py", line 40, in wait
    if not wait([self.sentinel], timeout):
  File "/root/miniconda3/lib/python3.10/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/root/miniconda3/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt: 
^C

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:98:00.0 Off |                  N/A |
|  0%   25C    P8             22W /  370W |     564MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:B1:00.0 Off |                  N/A |
|  0%   26C    P8             15W /  370W |       4MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

LKAMING97 commented 3 weeks ago

I can't generate anything in your example. What did I do wrong?

EthanC111 commented 2 weeks ago

Thank you for your interest! Inference on Anole-7b requires at least 20GB of memory. It might be related to memory issue. Do you mind try using another GPU with larger memory? Thanks!

LKAMING97 commented 2 weeks ago

Can I not perform model inference with a single 3090 24GB ?

-LLLKAMING- @.***

------------------ Original ------------------ From: Ethan Chern @.> Date: Fri,Sep 13,2024 9:34 PM To: GAIR-NLP/anole @.> Cc: LKAMING @.>, Author @.> Subject: Re: [GAIR-NLP/anole] Inference Problem (Issue #42)

Thank you for your interest! Inference on Anole-7b requires at least 20GB of memory. It might be related to memory issue. Do you mind try using another GPU with larger memory? Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

EthanC111 commented 2 weeks ago

Hi, quantization might be helpful: https://github.com/GAIR-NLP/anole/pull/21/files If not, please let us know! Thanks.

LKAMING97 commented 1 week ago

I reinstalled the environment and ran it according to the steps, but this still happens.

Traceback (most recent call last):
  File "interleaved_generation.py", line 5, in <module>
    from chameleon.inference.chameleon import ChameleonInferenceModel, Options
  File "/root/autodl-tmp/anole/chameleon/inference/chameleon.py", line 32, in <module>
    from chameleon.inference import loader
  File "/root/autodl-tmp/anole/chameleon/inference/loader.py", line 13, in <module>
    from chameleon.inference.transformer import ModelArgs, Transformer
  File "/root/autodl-tmp/anole/chameleon/inference/transformer.py", line 19, in <module>
    class ModelArgs:
  File "/root/autodl-tmp/anole/chameleon/inference/transformer.py", line 24, in ModelArgs
    n_kv_heads: int | None = None
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

LKAMING97 commented 1 week ago

how to fix it

LKAMING97 commented 1 week ago

I have been installing according to your steps, but I keep having problems, which makes me very frustrated.

Chaoran-F commented 1 week ago

Use Python 3.10 or change them to the type of "rank: Union[int, None] = None", I recommend to use Python 3.10, I found a lot place need to change .

Lulahei commented 1 week ago

Hi, quantization might be helpful: https://github.com/GAIR-NLP/anole/pull/21/files If not, please let us know! Thanks.

after i use the quantization function,the program also says OutOfMemoryError as to:

Instruction: draw a dog
Batch size: 10
VQModel loaded from /data/mjl/model_zoo/Anole-7b-v0.1/tokenizer/vqgan.ckpt
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/data/mjl/anole-main/chameleon/inference/chameleon.py", line 495, in _worker_impl
    model = loader.load_model(model, rank=rank)
  File "/data/mjl/anole-main/chameleon/inference/loader.py", line 61, in load_model
    return _convert(
  File "/data/mjl/anole-main/chameleon/inference/loader.py", line 23, in _convert
    torch.load(str(consolidated_path), map_location='cuda'),
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1014, in load
    return _load(opened_zipfile,
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1422, in _load
    result = unpickler.load()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1392, in persistent_load
    typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1366, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 1296, in restore_location
    return default_restore_location(storage, map_location)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
    result = fn(storage, location)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/serialization.py", line 279, in _cuda_deserialize
    return obj.cuda(device)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/site-packages/torch/_utils.py", line 114, in _cuda
    untyped_storage = torch.UntypedStorage(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 18.75 MiB is free. Process 46513 has 558.00 MiB memory in use. Including non-PyTorch memory, this process has 23.12 GiB memory in use. Of the allocated memory 22.83 GiB is allocated by PyTorch, and 1.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
^CTraceback (most recent call last):
  File "/data/mjl/anole-main/text2image.py", line 83, in <module>
    main(args)
  File "/data/mjl/anole-main/text2image.py", line 29, in main
    unquantized_model = ChameleonInferenceModel(
  File "/data/mjl/anole-main/chameleon/inference/chameleon.py", line 569, in __init__
    self.dctx.ready_barrier.wait()
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/threading.py", line 668, in wait
    self._wait(timeout)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/threading.py", line 703, in _wait
    if not self._cond.wait_for(lambda : self._state != 0, timeout):
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/synchronize.py", line 313, in wait_for
    self.wait(waittime)
  File "/data/mjl/anaconda3/envs/anole/lib/python3.10/multiprocessing/synchronize.py", line 261, in wait
    return self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt

my device also RTX 3090 ,i dont know how to solve this problem.If you can help me solve it, I would be extremely grateful

Lulahei commented 1 week ago

also,before quantization the free memory is 18.75 MiB,after quantization the free memory is 18.75 MiB too,is the function is not work? 2a20d05006feb3a29c95db8b96618c3

1084e9597b8ce855fb45fc37fc9335c

GAIR-NLP / anole

Inference Problem #42