foundation-model-stack fms-fsdp issues

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

112 stars 17 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

fix: Correct the typo

#96 Akash-Nayak opened 5 days ago
0
Re-add support for max_ckpt

#95 daviswer closed 1 week ago
1
FMS-FSDP running on A100 8GPU machine failed with NCCL error messages

#94 htang2012 opened 3 weeks ago
0
The default model variant is 7b but it is not supported.

#93 htang2012 opened 3 weeks ago
2
Repeatability of Small Model Training Script with fixed seed(s) and same dataset

#92 pad9153 opened 3 weeks ago
1
Support nested folders for datasets

#91 thinkahead opened 3 weeks ago
1
Allow nested folders for datasets with arrow files

#90 thinkahead opened 3 weeks ago
0
Question on 7B H100 MFU

#89 jasonkrone closed 1 month ago
2
The model conversion to hf is broken with the latest Fused GatedLinearUnit Support in ibm-fms 0.0.6

#88 thinkahead closed 4 weeks ago
2
update conversion script

#87 lchu-ibm closed 1 month ago
0
Remove initial -100 label from CLM targets

#86 daviswer closed 1 month ago
0
Remove checkpoint dataset diag print

#85 daviswer closed 1 month ago
0
More comprehensive dummy token handling

#84 daviswer closed 1 month ago
0
Loader streamlining and cleaning

#83 daviswer closed 1 month ago
0
Pull Llama3 token fix into loader-cleaner

#82 daviswer closed 1 month ago
0
Add proper BOS support to dataloader

#81 daviswer closed 1 month ago
0
Remove Llama2 drop_last_token arg

#80 lchu-ibm closed 1 month ago
0
Update HF deps to keep up with FMS main

#79 ani300 closed 1 month ago
0
Not Able to Reproduce Multi-Node Throughput for 7B Model on 8 Node H100 Cluster

#78 jasonkrone closed 2 months ago
3
add llama3 1b version

#77 lchu-ibm closed 2 months ago
0
add llama3 8b config

#76 lchu-ibm closed 2 months ago
0
Enable asynchronous dataloading

#75 daviswer closed 1 month ago
4
Update train_specu to main

#74 daviswer closed 2 months ago
0
update default configs

#73 lchu-ibm closed 2 months ago
0
populate more configs and metrics to Tracker

#72 lchu-ibm closed 2 months ago
0
Improve the speed of fms-hf converter

#71 lchu-ibm closed 2 months ago
0
A revisit on improving the performance of Data Loader

#70 lchu-ibm opened 2 months ago
2
Dataloader updates

#69 daviswer closed 2 months ago
8
centralize place for defining block

#68 lchu-ibm closed 3 months ago
0
Unable to Replicate MFU for 7B on 80gb A100

#67 jasonkrone closed 3 months ago
3
[speculator training] Support for loading different HF checkpoints for speculator training

#66 pavi2707 opened 3 months ago
1
switch to new meta device init method

#65 lchu-ibm closed 3 months ago
5
A write-up on Meta Device Init x Pretraining

#64 lchu-ibm opened 3 months ago
0
maximize mistral throughput

#63 aldopareja opened 3 months ago
2
[peculator training] Update benchmark_speculator_logical.py to support gpt_bigcode/granite

#62 sahilsuneja1 opened 3 months ago
9
add support for converting compiled model to hf

#61 lchu-ibm closed 3 months ago
5
add support for rank0 only profiler

#60 lchu-ibm closed 3 months ago
1
make fms-to-hf support for "compiled" model

#59 lchu-ibm closed 3 months ago
0
add Rank0-only profiler

#58 lchu-ibm closed 3 months ago
0
more flexible selective ac

#57 lchu-ibm closed 3 months ago
3
make selective ac more flexible.

#56 lchu-ibm closed 3 months ago
9
add Aim support

#55 lchu-ibm closed 3 months ago
4
fix meta device initialization for very large models

#54 mayank31398 closed 3 months ago
6
revert "raise Dynamo accumulated cache size limit"

#53 lchu-ibm opened 3 months ago
0
fix grow factor in hf conversion

#52 lchu-ibm closed 3 months ago
2
increase accumulated_cache_size_limit to 128 to make 70b compile-able

#51 lchu-ibm closed 3 months ago
0
lint

#50 lchu-ibm closed 3 months ago
0
re-order compile fix

#49 lchu-ibm closed 3 months ago
0
fix compile rope

#48 lchu-ibm closed 3 months ago
0
add rope fix for rope to work with compile

#47 lchu-ibm closed 3 months ago
0