GAIR-NLP / anole

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
https://huggingface.co/spaces/ethanchern/Anole
618 stars 33 forks source link

Is it possible to load the weights across multi GPUs ? #7

Open captainst opened 1 month ago

captainst commented 1 month ago

Hello Everyone,

As I don't have a single GPU with enough VRAM, I thought about modifying the loader.py to add accelerate support:

from accelerate import init_empty_weights, load_checkpoint_and_dispatch
...
def _convert(model_args: ModelArgs, consolidated_path: Path) -> Transformer:
    ...
    with init_empty_weights():
        model = Transformer(model_args)

    model = load_checkpoint_and_dispatch(
        model,
        checkpoint=str(consolidated_path),
        device_map="auto"
    )

Unfortunately, it does not work. I figured out that the reason might be related to the fact that the Transformer class from chameleon makes use of a custom hook that manipulates the tensors:```

def load_hook(
    self,
    state_dict,
    prefix,
    local_metadata,
    strict,
    missing_keys,
    unexpected_keys,
    error_msgs,
):
    if prefix + "wq.weight" in state_dict:
        wq = state_dict.pop(prefix + "wq.weight")
        wk = state_dict.pop(prefix + "wk.weight")
        wv = state_dict.pop(prefix + "wv.weight")
        state_dict[prefix + "wqkv.weight"] = torch.cat([wq, wk, wv]


I am wondering if anybody has a work-around for it.

Many thanks !
EthanC111 commented 1 month ago

Thanks for bringing up this issue! We will take time to look into it (but it make take a while).

captainst commented 1 month ago

@EthanC111 Thank you! I went through some pitfalls during the weekend and managed somehow to distribute the weight across different GPUs using huggingface accelerate library.

The current blockage seems to be related to the fact that chameleon uses torch.distributed and torch.multiprocessing during the inference. While hf accelerate lib also uses these when making multi-GPU inference.

I am thinking of modifying the realization of chameleon and get rid of the torch.distributed to see.

captainst commented 1 month ago

It turned out that the conflict of cuda graph was caused by self._forward = cudagraph_wrap(self._model.forward_with_attn_bias) inside model_adapter.py.

After removing the "cudagraph_wrap": self._forward = self._model.forward_with_attn_bias

I was able to run the inference using multiple 2080ti, using load_checkpoint_and_dispatch from accelerate lib, specifying dtype=torch.float32 (since 2080ti does not support bf16). There are some other modifications to use a custom device_map. It took quite a few minutes to run the sample with batch size==1: python text2image.py -i 'draw a dog' -b 1

But it produces bad images.

I am wondering if anyone has an environment of multiple GPUs with compute capability >= 8.0. That could make a test using original bf16.

weweus commented 4 weeks ago

It turned out that the conflict of cuda graph was caused by self._forward = cudagraph_wrap(self._model.forward_with_attn_bias) inside model_adapter.py.

After removing the "cudagraph_wrap": self._forward = self._model.forward_with_attn_bias

I was able to run the inference using multiple 2080ti, using load_checkpoint_and_dispatch from accelerate lib, specifying dtype=torch.float32 (since 2080ti does not support bf16). There are some other modifications to use a custom device_map. It took quite a few minutes to run the sample with batch size==1: python text2image.py -i 'draw a dog' -b 1

But it produces bad images.

I am wondering if anyone has an environment of multiple GPUs with compute capability >= 8.0. That could make a test using original bf16.

I encounter the same problem with yours, and I am wondering whether performing dtype conversion with torch.Tensor.to(bf16 to fp32) will result in the reduction of precision.