Closed ryusaeba closed 2 months ago
Hi!
@disclaimer1 I am not an author, but regularly talk with the first two authors.
@disclaimer2 You edited the message several times while I was writing this answer (not complaining, thanks for the clarification). Below, I am quoting you on some of the words you said in the earlier versions of the message, before you edited it.
GPTQ-version of PV tuning
To the best of my knowledge, authors do plan to release it, but it is not the top priority r/n. The GPTQ experiments were using a very hacky version of the code that manually partitioned hessians between devices and only work for llama-2 7B with that specific 2-bit configuration (on 8x a100 and very inefficiently). To make this worth releasing, we'll need to upgrade the code so it actually runs outside of our specific case.
In the previous version of the message (iirc), you ask about adapting current vq code for GPTQ. There is one caveat to this: during V step, the current code runs beam search that considers all codes. This is very wasteful for GPTQ: instead, you obtain the new codes by rounding the update to the nearest integer, after taking scale and zero-point into account. Aside from that, the rest of the code should work.
maximum tokens is 10T?
In theory, the training script would indeed stop after processing roughly 10T (EDIT: actually 10B, see below) tokens. However, all llama models either converged before completing 2 full epochs or showed very small improvements after 2 epochs (within 0.03 PPL wiki2), and mistral / phi models requires 2-4 epochs to converge.
If you want to get the best quality (as opposed to reproducing our exact setup), I'd recommend that you explore training on more data rather than repeating the same sample multiple times. In the paper, authors had to use the sample to make a fair comparison with other works that used that sample.
Instruct-version model
To the best of my knowledge, the only instruct version released in the PV paper is Phi 3 instruct. This model was fine-tuned on a sample of RedPajama data for fair comparison.
However, @Godofnothing also recently released a llama-3 70B instruct that is not included in the PV paper. I don't know the exact dataset that he used, but IIRC he was experimenting with Cosmopedia and SlimOrca. I don't know the exact answer, but it is likely one of them.
@Godofnothing if you're available, please educate us :)
You edited the message several times while I was writing this answer (not complaining)3
Sorry about that. I should send the message after confirming everyhing questions are valid. I'm happy and appreciate for your detailed response.
To make this worth releasing, we'll need to upgrade the code so it actually runs outside of our specific case.
Sounds great. Will look forward your valuable release.
About 10T,
After checking, you actually using 10B, istead of 10T. I didn't notice the dataset you are using had suffix Sample
previously.
Redpajamas on Phi 3 instruct
When applying pretrain dataset on instruct model, did you see obvious quality degradataion on chat?
llama-3 70B instruct that is not included in the PV paper. I don't know the exact dataset that he used, but IIRC he was experimenting with Cosmopedia and SlimOrca. I don't know the exact answer, but it is likely one of them.
It's nice to see @Godofnothing reponse.
Thank you again :)
When applying pretrain dataset on instruct model, did you see obvious quality degradation on chat?
To see degradation, one needs a comparison against something. To the best of my knowledge, authors currently do not have such an alternative for Phi 3 models, (e.g. did not train on cosmopedia yet). For Llama 3, @Godofnothing knows best.
10B instead of 10T
Yes, good catch :) The entire RedPajama dataset is just 1T tokens, PV-Tuning paper uses a 1B sample for calibration.
However, all llama models either converged before completing 2 full epochs or showed very small improvements after 2 epochs (within 0.03 PPL wiki2), and mistral / phi models requires 2-4 epochs to converge.
The 2 full epochs achieves this results is pretty promising.
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
FV tuning achieves very impressive result even on GPTQ. Is there any plan to relase the GPTQ-version of FV tuning? If not, shoudl we modify it based on finetune_fsdp.py?
The other question is your released model cover instruct-version model. May we know what's the dataset for non-pretrained model?