ServiceNow / Fast-LLM

Accelerating your LLM training to full speed
https://servicenow.github.io/Fast-LLM/
Other
37 stars 5 forks source link

Roadmap #27

Open jlamypoirier opened 3 weeks ago

jlamypoirier commented 3 weeks ago

This is a tentative roadmap for major improvements to fast LLM. It includes big features and potential breaking changes, but excludes minor features and additions.

It goes in several parts, with the following milestones: v0.1 (2024-10-11): First open-source version v0.2 (~Q4 2024): Follow-up to address technical debt on checkpoints and configs, with several breaking changes held up to this point. v0.3 (~Q1 2025): Further generalization to enable other models, ex. multimodal, with limited breaking changes

Config and checkpoints (v0.2)

Structured configuration (done https://github.com/ServiceNow/Fast-LLM-Internal/pull/304, https://github.com/ServiceNow/Fast-LLM-Internal/pull/308, https://github.com/ServiceNow/Fast-LLM-Internal/pull/315, https://github.com/ServiceNow/Fast-LLM-Internal/pull/316)

Replace the flat argparse format with a nested one using a . separator, allow configuring from yaml file.

Rename config parameters (partially done #1, #6, etc.)

With structured config, many parameter name become redundant, ex: pretrained.pretrained_checkpoint_path can be unambiguously simplified to pretrained.checkpoint_path. This is also a good opportunity to clean things up, make names core consistent, etc.

Checkpoint improvements (partially done https://github.com/ServiceNow/Fast-LLM-Internal/pull/308 #6 #18 #22)

We need to do some breaking changes on the checkpoints:

Config metadata (partially done https://github.com/ServiceNow/Fast-LLM-Internal/pull/291)

Open-sourcing follow-up

Documentation, publications, benchmarks, etc.

Model generalization (v0.3)

Enable custom models and trainers (done https://github.com/ServiceNow/Fast-LLM-Internal/pull/319)

Generalize data #25

The data class is currently hard-coded to a gpt dataset. We need to make it more easily adaptable to other data formats and data loading schemes.

Implement a non-trivial second model example

We want to demonstrate generalizability with another model, ex. by wrapping a pytorch and/or huggingface model as a (poorly optimized) Fast-LLM model.

Generalize/rethink batch config and schedule #115

Generalize trainer and metric logging #115

There is already a generic trainer class, but it still have some non-generic components, especially when it comes to logging.

Developer documentation for adding a new model/feature (partially done https://github.com/ServiceNow/Fast-LLM-Internal/pull/319)

Long-term features (v0.4+)

These would be great additions, but are not yet on a clear roadmap.

Document Fast-LLM best practices for performance

Implement staged training

Generalize optimizer

We are mostly hard-coded to Adam.

Generalize Schedule

Triton optimizer

A triton implementation of multi-tensor Adam would give a small performance boost, avoid some explicit cpu-gpu synchronizations and remove our dependence on third-party kernels. The multi-tensor part could be challenging.

Optimize for inference

Support non-nvidia GPUs

Blockers: distributed (nccl), apex optimizer, flash attn? Triton and pytorch kernels should be OK but need to verify

Technical debt (v0.x)

These will eventually cause trouble but aren't urgent yet, they are indicated here to indicate what is likely to change in the future.

Rework logging

Rework Distributed

Rework Run (partially done #1)

We probably want to get rid of most of it.

Factor out core

This legacy module doesn't really make sense anymore

Refactor functional

The distinction between triton and others is no longer relevant.

Rethink the model input_, kwargs

The unstructured format of the model input (ex. Layer.forward(self, input_: torch.Tensor, kwargs: dict, ...) is already confusing and error-prone, and things will keep getting worse. We'll want to add more structure.

tscholak commented 3 weeks ago

Hi @jlamypoirier, that's a great roadmap. I'm creating milestones so that we can easily track progress.