HumanEval benchmarks? - Githubissues

BlinkDL / RWKV-LM

RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

Apache License 2.0

12.55k stars 855 forks source link

HumanEval benchmarks? #43

Open philipturner opened 1 year ago

philipturner commented 1 year ago

Hi, I was wondering whether this model can achieve GPT-4 level performance on the HumanEval benchmark, a proxy for effectiveness at code generation. I'm fine if I have to train or transfer-learn, but I only have a single GPU. Massive transformers require way too much compute. I'd also like your opinion on how well the model might work in tandem with InCoder and LLaMa/Alpaca in a model ensemble. Sort of like Stable Diffusion, which has multiple different models that each specialize in a different task. Thanks!

BlinkDL commented 1 year ago

Yeah a code model finetuned from existing ones will be great :)

philipturner commented 1 year ago

Who generated the table of benchmarks posted on the README? I'd like to know if there's an easy way to evaluate new benchmarks, or if whoever generated those benchmarks is willing to run one extra. I just want to see performance of the non-fine-tuned RNN. If this is too much to ask, that's fine.

alreadydone commented 1 year ago

IIUC the Alpaca repo doesn't contain the model (because LLaMA isn't openly available except for the leak) but contains the code to fine-tune to make output resemble ChatGPT. Eliezer Yudkowsky seems to be enthusiastic about this. Is it planned to use Alpaca data to fine-tune RWKV-LM? However, the GPT-4 paper also revealed that instruction fine-tuning doesn't improve the model's performance on exams:

Note that the model’s capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it). But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions.

philipturner commented 1 year ago

Thanks for that insight. I really just want to have GPT-4 leak so I can run it on my personal computer. Someone reproduced Stanford's methods and open-sourced the weights with Alpaca-LoRA, and people theorized ways to fine-tune LLaMa-30B after 4 bit quantization (I have a 32 GB GPU). The quantized model is posted here. It seems useful as a primary "reasoning model" in a model ensemble, much better than 7B. That could then delegate other processes to specialized models, like an AGI controlling ANI.

I'm not completely sure what RWKV's strongsuits are; where it might stand in the ensemble. I'm just researching all the options to make the most informed choice. For reference, I'm trying to make an AI-powered transpiler that converts massive CUDA code bases to Metal.

Is it planned to use Alpaca data to fine-tune RWKV-LM?

No idea yet. I need to see RWKV's out-of-the-box performance before knowing whether it's worthwhile.

alreadydone commented 1 year ago

Thanks a lot for the info! I didn't know about Alpaca-LoRA. More is going on around LLaMA than I realized! AFAIK, being an RNN, RWKV is less resource intensive than transformers; although llama.cpp and nolano.org make LLaMA run on consumer hardware, 4bit/8bit inference is also now available for RWKV at https://bellard.org/ts_server/, so I guess RWKV keeps this advantage.

Is it planned to use Alpaca data to fine-tune RWKV-LM?

I was actually asking the repo owner, since he's been training an instruction finetuned model at https://github.com/BlinkDL/ChatRWKV, so it's natural to wonder if he plan to use this new dataset :)

philipturner commented 1 year ago

Regarding orders of magnitude, here's the resources required to train each of the high-performing codegen models. @BlinkDL can your RNN absorb >100B tokens of knowledge on consumer hardware in a single GPU-day?

Model	Parameters	HumanEval@1	HumanEval@100	MBPP@1	Training Tokens
GPT-4 family	~350B-2.5T	67.0%	~98%	-	~22-400 trillion
GPT-3.5	175B	48.1%	~92%	-	~8-20 trillion
LLaMa-30B	30B	21.7%	70.7%	30.2@1	1.4 trillion
InCoder-6.7B	6.7B	15.2%	47.0%	19.4@1	~60 billion
RWKV-LM	14B	-	-	-	-	-

Model	GPUs	GPU-Hours	GPU Type	per-GPU BW	sqrt(arithmetic intensity)
GPT-4 family	10000	5-90 million	A100	2039 GB/s	12.37
GPT-3.5	10000	5-90 million	A100	2039 GB/s	12.37
LLaMa-30B	2048	530,432	A100	2039 GB/s	12.37
InCoder-6.7B	248	142,848	V100	900 GB/s	11.79
Me	1	24	M1 Max	400 GB/s	4.59

BlinkDL commented 1 year ago

RWKV has less FLOPS than GPT in both training and inference so it's simply faster and saves VRAM.

alreadydone commented 1 year ago

https://twitter.com/piesposi_to/status/1636780485597708290 seems useful to finetune ChatRWKV for Chinese instruction following.

alreadydone commented 1 year ago

FYI there's this recent work that achieves 88% at HumanEval@1.

philipturner commented 1 year ago

That’s nice to hear, but Reflexion is only worth the effort if you’re using it to complement GPT-4. It’s not a substitute for the gap between LLaMA and GPT-4, but maybe between GPT-3.5. The core issue, is GPT-4 has advanced reasoning capabilities limited only by its narrow-context, linear-thought architecture. Reflexion sort of compensates for the architectural limitation. Other LLMs don’t have near enough reasoning capabilities to begin with.

Meanwhile, I’ve realized I can accomplish my original goal with a different approach. Bing GPT-4 can plan out how to tackle a hard programming task, removing the psychological barrier to me doing the task myself. That’s helped me revive tedious projects like FP64 emulation and could very well let me port medium-sized CUDA code bases (e.g. oxDNA). The AI does not need to scan all the code files, which was my original idea with a locally hosted LLM.

We live in a very different world than one month ago…