FMInference / FlexiGen

Running large language models on a single GPU for throughput-oriented scenarios.
Apache License 2.0
9.14k stars 540 forks source link

Add Erebus and GALACTICA support #40

Open Sumanai opened 1 year ago

Sumanai commented 1 year ago

Hello! I propose to add support for the Erebus family of models, these are finetune models of the original OPT. I looked at the code, and the support is not too difficult to add, and I was able to run a couple of models without major code modification. I can provide PR if needed. The link to one of the models, there are also the rest. https://huggingface.co/KoboldAI/OPT-2.7B-Erebus

oobabooga commented 1 year ago

GALACTICA support would be nice as well. Can FlexGen be generalized to all OPTForCausalLM models?

Sumanai commented 1 year ago

Unfortunately, the attempt to add GALACTICA in the same way failed. The problem seems to be the lack of handling parameters like attention_dropout, but this is purely a guess. After loading and at the first generation an error appears in the logs (I removed the repeating parts):

C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1141: block: [223,0,0], thread: [29,0,0  File "c:\users\username\flexgen\flexgen\flex_opt.py", line 873, in generate
] Assertion `srcIndex < srcSelectDimSize` failed.
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu    self.generation_loop_overlap_single_batch()

  File "c:\users\username\flexgen\flexgen\flex_opt.py", line 1013, in generation_loop_overlap_single_batch
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\Indexing.cu:1141: block: [223,0,0    self.sync()
], thread: [31,0,0  File "c:\users\username\flexgen\flexgen\flex_opt.py", line 782, in sync
] Assertion `srcIndex < srcSelectDimSize    torch.cuda.synchronize()
` failed.
  File "C:\Users\username\AppData\Roaming\Python\Python310\site-packages\torch\cuda\__init__.py", line 566, in synchronize
    return torch._C._cuda_synchronize()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

If we can solve this problem, we can remove some of the hardcode and let you load any model based on OPTForCausalLM.

Ying1123 commented 1 year ago

GALACTICA support would be cool! I think FlexGen can be generalized to OPTForCausalLM very easily. The error reported by @Sumanai looks wired to me. Need more investigation.

Ph0rk0z commented 1 year ago

Is this just partial support?: https://github.com/FMInference/FlexGen/pull/83

oobabooga commented 1 year ago

I have tried loading galactica-30b and I got this error:

    opt_config.py", line 118, in get_opt_config
        raise ValueError(f"Invalid model name: {name}")

ValueError: Invalid model name: galactica-30b

Not sure if that commit has already made it to flexgen==0.1.7 or if it is enough to load GALACTICA.

apenugon commented 1 year ago

I got a similar error to @Sumanai when using Erebus-13b on a 3080 when the text length gets long -

../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [6,0,0], thread: [91,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [6,0,0], thread: [92,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [6,0,0], thread: [93,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [6,0,0], thread: [94,0,0] AssertionsrcIndex < srcSelectDimSizefailed. ../aten/src/ATen/native/cuda/Indexing.cu:922: indexSelectSmallIndex: block: [6,0,0], thread: [95,0,0] AssertionsrcIndex < srcSelectDimSizefailed.

Tried changing policy parameters but nothing seems to work.

fgdfgfthgr-fox commented 1 year ago

I managed to make FlexGen work for Galactica-1.3b model by changing opt_config.py, flex_opt.py and tokenizer_config.json. @oobabooga 's Webui can successfully load the model and generate text using it. Vram use decreased as expected. However, all the text generated become gibberish (it's not due to parameter preset). Maybe someone would be interested in taking a closer look? I can upload the files I modified. I am not really a programming or ML expert... 2023-03-30 20-35-02屏幕截图 2023-03-30 20-34-37屏幕截图

oobabooga commented 1 year ago

@fgdfgfthgr-fox can you create a fork of https://github.com/FMInference/FlexGen with your changes?

fgdfgfthgr-fox commented 1 year ago

@fgdfgfthgr-fox can you create a fork of https://github.com/FMInference/FlexGen with your changes?

@oobabooga https://github.com/fgdfgfthgr-fox/FlexGen---galactica-support Is this what you want?

Mar2ck commented 1 year ago

@Sumanai How did you get Erebus working?

Sumanai commented 1 year ago

@Sumanai How did you get Erebus working?

You can see my dirty edits in my repository. https://github.com/Sumanai/FlexGen/tree/erebus I hope this code will help explorers in adding Galactica support.