Support for hugging face GPTBigCode model

jiaozaer commented 1 year ago

Hello！ I want to convert starcoder to fastertransformer format for inference,Here is the link：https://huggingface.co/bigcode/starcoder This model belongs to GPTBigCode：https://huggingface.co/docs/transformers/v4.28.1/en/model_doc/gpt_bigcode， Is it currently not supported to convert GPTBigCode format to fastertransformer format？

huyphan168 commented 1 year ago

As long as I know, StarCoder is using multi-query attention which is not currently supported in FasterTransformer. Who want to convert GPTBigCode into FT should take care of this.

starlitsky2010 commented 1 year ago

Is there any plan for supporting StarCoder? I'm also trying converting it to FT model.

Xingxiangrui commented 1 year ago

I plan to do it. Did you get started ? how is it going now ?

EarthXP commented 1 year ago

I plan to do it. Did you get started ? how is it going now ?

any progress?

Xingxiangrui commented 1 year ago

I plan to do it. Did you get started ? how is it going now ?

any progress?

The support work for MQA has been completed, and the FT acceleration for Wizard has also been completed. Due to some information security reasons, I cannot provide the source code, but I can provide the reproduction steps for supporting MQA.

Both MQA and GQA can reuse the structure of MHA. Just copy the KV weights in WQKV n_head times. This sentence is the key point, please understand it carefully. Especially for the torch MQA code, it is recommended to compare it with the torch MHA code in detail to understand why MQA can reuse the structure of MHA.

Based on reusing the MHA structure, you don't need to modify any FT cpp source code or write cuda from scratch, just convert the weights correctly can complete it.

Follow the steps below to implement MQA's FT, which can be completed in about 3 weeks. If you are proficient in FT, it can be completed in 1 to 2 weeks.

Use santacoder for debugging (because it is relatively small, only 1.1B. If you use a 16B model, loading weights once takes a very long time, and debugging is not as fast as 1.1B). Santacoder-mha is aligned with the GPT2 structure and can be quickly aligned with FT implementation. Implement this first. Use santacoder-mqa. Note that, as mentioned above, understand the structure and copy KV_cache n_head times. Make sure that santacoder-mqa's FT is aligned with torch. At this point, you have mastered the implementation steps of MQA. Then you can use a large model for conversion and alignment. Note that the entire process does not require rewriting the FT cpp code...

If you have any question , please do not hesitate to leave a comment.

Xingxiangrui commented 1 year ago

Newest InternLM can also accelerate GQA, and it can support GPT-Q 4-bit quant. https://github.com/InternLM/lmdeploy#introduction InternalLM seems more efficient than FT.

NVIDIA / FasterTransformer

Support for hugging face GPTBigCode model #603