Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Apache License 2.0
1.14k stars 173 forks source link

FV tuning based on GPTQ #99

Closed ryusaeba closed 2 months ago

ryusaeba commented 4 months ago

FV tuning achieves very impressive result even on GPTQ. Is there any plan to relase the GPTQ-version of FV tuning? If not, shoudl we modify it based on finetune_fsdp.py?

The other question is your released model cover instruct-version model. May we know what's the dataset for non-pretrained model?

justheuristic commented 4 months ago

Hi!

@disclaimer1 I am not an author, but regularly talk with the first two authors.

@disclaimer2 You edited the message several times while I was writing this answer (not complaining, thanks for the clarification). Below, I am quoting you on some of the words you said in the earlier versions of the message, before you edited it.

GPTQ-version of PV tuning

To the best of my knowledge, authors do plan to release it, but it is not the top priority r/n. The GPTQ experiments were using a very hacky version of the code that manually partitioned hessians between devices and only work for llama-2 7B with that specific 2-bit configuration (on 8x a100 and very inefficiently). To make this worth releasing, we'll need to upgrade the code so it actually runs outside of our specific case.

In the previous version of the message (iirc), you ask about adapting current vq code for GPTQ. There is one caveat to this: during V step, the current code runs beam search that considers all codes. This is very wasteful for GPTQ: instead, you obtain the new codes by rounding the update to the nearest integer, after taking scale and zero-point into account. Aside from that, the rest of the code should work.

maximum tokens is 10T?

In theory, the training script would indeed stop after processing roughly 10T (EDIT: actually 10B, see below) tokens. However, all llama models either converged before completing 2 full epochs or showed very small improvements after 2 epochs (within 0.03 PPL wiki2), and mistral / phi models requires 2-4 epochs to converge.

If you want to get the best quality (as opposed to reproducing our exact setup), I'd recommend that you explore training on more data rather than repeating the same sample multiple times. In the paper, authors had to use the sample to make a fair comparison with other works that used that sample.

Instruct-version model

To the best of my knowledge, the only instruct version released in the PV paper is Phi 3 instruct. This model was fine-tuned on a sample of RedPajama data for fair comparison.

However, @Godofnothing also recently released a llama-3 70B instruct that is not included in the PV paper. I don't know the exact dataset that he used, but IIRC he was experimenting with Cosmopedia and SlimOrca. I don't know the exact answer, but it is likely one of them.

@Godofnothing if you're available, please educate us :)

ryusaeba commented 4 months ago

You edited the message several times while I was writing this answer (not complaining)3

Sorry about that. I should send the message after confirming everyhing questions are valid. I'm happy and appreciate for your detailed response.

To make this worth releasing, we'll need to upgrade the code so it actually runs outside of our specific case.

Sounds great. Will look forward your valuable release.

About 10T,

After checking, you actually using 10B, istead of 10T. I didn't notice the dataset you are using had suffix Sample previously.

Redpajamas on Phi 3 instruct

When applying pretrain dataset on instruct model, did you see obvious quality degradataion on chat?

llama-3 70B instruct that is not included in the PV paper. I don't know the exact dataset that he used, but IIRC he was experimenting with Cosmopedia and SlimOrca. I don't know the exact answer, but it is likely one of them.

It's nice to see @Godofnothing reponse.

Thank you again :)

justheuristic commented 4 months ago

When applying pretrain dataset on instruct model, did you see obvious quality degradation on chat?

To see degradation, one needs a comparison against something. To the best of my knowledge, authors currently do not have such an alternative for Phi 3 models, (e.g. did not train on cosmopedia yet). For Llama 3, @Godofnothing knows best.

10B instead of 10T

Yes, good catch :) The entire RedPajama dataset is just 1T tokens, PV-Tuning paper uses a 1B sample for calibration.

ryusaeba commented 4 months ago

However, all llama models either converged before completing 2 full epochs or showed very small improvements after 2 epochs (within 0.03 PPL wiki2), and mistral / phi models requires 2-4 epochs to converge.

The 2 full epochs achieves this results is pretty promising.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.