Improve quickstart guide

ServiceNow / Fast-LLM

Accelerating your LLM training to full speed

Other

37 stars 5 forks source link

✨ Description

Some changes to make the tutorial easier to run (WIP). The goal of the quick start should be to allow running something as fast as possible, and I'm making sure it's the case.

Simplify the docker guide by starting a docker container right away and running the rest locally.

Rewrite the "Local environment" guide and merge with docker.

Rework the experiment path: Avoid messing up the user's home, use a local fast_llm_tutorial instead, and mount to /app/fast_llm_tutorial so paths are the same in every environment (not totally sure about kubernetes).

[WIP] Make separate tabs for trial run with tiny dataset and simpler config (get to running faster, what most user want), and full-scale run with the big dataset and full config. Could instead just show the trial run and make a separate section for the full-scale run?

Drop wandb by default?

[TODO] Fix inconsistent config file name (train-config.yaml, fast-llm-config.yaml)

Also removing markdownlint at least for now because it's too annoying. I complains on existing files and doesn't auto-fix errors like the other pre-commit things.

🔍 Type of change

Select all that apply:

[ ] 🐛 Bug fix (non-breaking change that addresses a specific issue)

[ ] 🚀 New feature (non-breaking change that adds functionality)

[ ] ⚠️ Breaking change (a change that could affect existing functionality)

[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)

[ ] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)

[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)

[x] 📝 Documentation change (updates documentation, including new content or typo fixes)

[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

Hi @jlamypoirier,

Thanks for going through the tutorial. I can see you are putting a lot of thought into improving it, but I have to push back on some of the proposed changes.

First, merging the Docker and local environment tabs into one doesn't feel right. These are distinct use cases, and combining them introduces unnecessary complexity and confusion for users. The Docker and local environment guides are already extremely simple. By merging these tabs, we will do both user groups a disservice. We should instead directly support an interactive Toolkit workflow by adding a fifth tab for it.

Second, changing the folder structure to combine inputs and outputs isn't ideal. Keeping inputs and outputs separate is a good practice, and it's how our workflows are designed. Adopting this structure early in the tutorial teaches users the right habits. Consolidating them offers no meaningful benefit and creates unnecessary churn in the documentation.

I also noticed you're suggesting running all commands within Docker. That's problematic. If a user creates folders or files in the Docker container without mounting volumes, those changes are lost when the container shuts down. Moreover, working entirely within Docker restricts users to text editors inside the container, whereas the current guide allows them to use any tools they're comfortable with outside Docker. This change doesn't add value compared to the current setup.

You might not have realized that the tabs are interlocked. When a user selects a tab (e.g., Docker), all other sections automatically switch to that tab as well, making the guide cohesive for each use case. If we have different tabs for each section of the guide then this behaviour is broken, and that makes the guide unnecessarily clunky. The current tab design works and is consistent. Let's keep Docker, local environment, Slurm, Kubernetes, and (new) Toolkit as separate, clearly-defined tabs throughout.

That said, I do like some of your changes. The refinements to the local environment installation instructions are helpful, and surfacing the option to use a truncated dataset earlier in the guide is a good idea. But that doesn't require a separate tab. A simple note before the config YAML preparation section to indicate that a different dataset path can be used for quicker results is enough.

To sum up:

Keep Docker and local environment tabs separate.
Add a separate tab for Toolkit.
Revert the folder structure changes.
Don't force everything into Docker in the Docker workflow.

Let's discuss this in person.

To summarize, here are some the issues that break the tutorial and/or make it more complicated than needed. I tried my best to fix them, and I'm not fully committed to my proposed solutions but these need to be addressed in one way or another.

Concerning the environment tabs:

The first tab should be as simple as possible. Right now it looks more complicated than necessary because of all the docker command
The distinction between the environment tabs isn't too clear-cut.
- Both "docker" and "local environment" are actually the same (local machine, ssh to workstation or interactive job), the difference should be whether the user has access to docker.
- Slurm and kubernetes are also docker.
- I expect the interactive job case (toolkit or not) to be the most common one, so I'd want the first tab to work as is for it.
Note: we don't strictly need to consolidate the tabs, but the changes I propose make some tabs completely identical so the distinction is unnecessary.
(Not fixed) The "initial setup" tab makes things look like a choice. It will almost always be dictated by the user's environment.

Concerning the directory structures: we need to simplify things.

Using ~/input and ~/results is dangerous because it messes up the user's home. (also mkdir ~/inputs ~/results didn't work in the toolkit job)
Having two mount points makes things more complicated than necessary. Let's not force a structure users don't care about at this point and will have to change anyway in a real use case.
The paths are inconsistent between environments. They sometimes go through ~/... or ~/mnt/...`, which complicates things
The config files don't work for the "Local environment" tab because of the above inconsistency
Note: In my proposed changes I kept the input/output separation (with some renaming, and separating the pretrained model from the dataset which should be a good thing). But I put them within the same directory to avoid messing up the user's environment.

Concerning the trial run vs full run:

Nobody wants to wait 2 hours to run to check out a training framework. The tiny dataset needs to be the default.
Similarly, nobody will actually train for 600k iterations on the first try. The current config makes data sampling really long at the beginning of training and prevents quick results (ex. the first export is at 20k iterations). We need the default config to run fast.
The default wandb config won't work for anyone (skipping the optional step is not an option), so better drop it from the config and show how to add it in the optional step.
(Not fixed) We need instructions (tip?) for volta.

ServiceNow / Fast-LLM

Improve quickstart guide #49

✨ Description

🔍 Type of change