`pretrain` vs `finetune_full`

Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.

https://lightning.ai

Apache License 2.0

9.67k stars 965 forks source link

`pretrain` vs `finetune_full` #1603

Closed fdalvi closed 1 month ago

fdalvi commented 1 month ago

Hello,

I was wondering what is the motivation behind pretrain vs finetune_full; conceptually they are quite similar, but at the moment there are some key (seemingly artificial) differences:

pretrain and finetune_full have different set of required arguments (e.g. max_tokens vs epochs)
activation_checkpointing seems to be only enabled for finetune_full (https://github.com/Lightning-AI/litgpt/blob/f80fefff6a5d127cc99c86be2d172415b49359d2/litgpt/finetune/full.py#L101 vs https://github.com/Lightning-AI/litgpt/blob/f80fefff6a5d127cc99c86be2d172415b49359d2/litgpt/pretrain.py#L134
Both use different data loaders
Both load models slightly differently leading to some bugs (e.g. #1430)

There seem to be other small differences as well as I'm walking through the code, so just wanted to understand the motivation and see if there is a "correct" time to use one of the other.

Thanks!

rasbt commented 1 month ago

Hi there, these are good questions. Off the top of my head, the major usage difference is the dataset. The finetune_* scripts are mainly designed for instruction-finetuning. (I wanted to name them like this, but I remember that this was not a popular opinion and also a bit late in the development where we already had these names).

So, in other words the data format is a bit different and with that also the scale. In the finetuning scripts the data would be small enough to fit into memory, and pretrain is designed to handle much larger datasets (here, raw data that doesn't come in the insruction-response format).

I think the differences like maxtokens and epochs come originally from the fact that we have training examples (instruction-response pairs) in the `finetune*` scripts. In regular pretraining, where we have raw text, it's easier to work with max tokens (which is common in the literature).

I hope this helps as a start :)!?

fdalvi commented 1 month ago

Hi @rasbt,

Thanks for the quick reply, that makes a lot of sense! Given that the primary difference is the data, would it be better then to have a shared codebase for all model related things (e.g. model loading, training loop, sharing strategies etc)?

Best,

rasbt commented 1 month ago

That's a fair point, but there is this philosophy in this repo that some code duplication isn't bad if it helps with readability. Because too much refactoring and code sharing can lead to lots of complexities if you want to read and modify the code. I.e., the code should remain simple enough that you can modify it if you want to tweak certain things for custom research projects. Of course, there is never a clear line to draw ...

Anyways, thanks for sharing your feedback here!

rasbt commented 1 month ago

Closing to clean up the issues a bit. But please feel free to respond or reopen in case you have additional questions.

fdalvi commented 1 month ago

I appreciate your perspective. If its okay, I'll open a PR sometime soon that brings the two codebases closer together where applicable (FSDP settings, model loading, perhaps a few more things); we can discuss which of these are worth merging ofcourse!