issues
search
foundation-model-stack
/
fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
https://pytorch.org/docs/stable/fsdp.html
Apache License 2.0
112
stars
17
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix: Correct the typo
#96
Akash-Nayak
opened
5 days ago
0
Re-add support for max_ckpt
#95
daviswer
closed
1 week ago
1
FMS-FSDP running on A100 8GPU machine failed with NCCL error messages
#94
htang2012
opened
3 weeks ago
0
The default model variant is 7b but it is not supported.
#93
htang2012
opened
3 weeks ago
2
Repeatability of Small Model Training Script with fixed seed(s) and same dataset
#92
pad9153
opened
3 weeks ago
1
Support nested folders for datasets
#91
thinkahead
opened
3 weeks ago
1
Allow nested folders for datasets with arrow files
#90
thinkahead
opened
3 weeks ago
0
Question on 7B H100 MFU
#89
jasonkrone
closed
1 month ago
2
The model conversion to hf is broken with the latest Fused GatedLinearUnit Support in ibm-fms 0.0.6
#88
thinkahead
closed
4 weeks ago
2
update conversion script
#87
lchu-ibm
closed
1 month ago
0
Remove initial -100 label from CLM targets
#86
daviswer
closed
1 month ago
0
Remove checkpoint dataset diag print
#85
daviswer
closed
1 month ago
0
More comprehensive dummy token handling
#84
daviswer
closed
1 month ago
0
Loader streamlining and cleaning
#83
daviswer
closed
1 month ago
0
Pull Llama3 token fix into loader-cleaner
#82
daviswer
closed
1 month ago
0
Add proper BOS support to dataloader
#81
daviswer
closed
1 month ago
0
Remove Llama2 drop_last_token arg
#80
lchu-ibm
closed
1 month ago
0
Update HF deps to keep up with FMS main
#79
ani300
closed
1 month ago
0
Not Able to Reproduce Multi-Node Throughput for 7B Model on 8 Node H100 Cluster
#78
jasonkrone
closed
2 months ago
3
add llama3 1b version
#77
lchu-ibm
closed
2 months ago
0
add llama3 8b config
#76
lchu-ibm
closed
2 months ago
0
Enable asynchronous dataloading
#75
daviswer
closed
1 month ago
4
Update train_specu to main
#74
daviswer
closed
2 months ago
0
update default configs
#73
lchu-ibm
closed
2 months ago
0
populate more configs and metrics to Tracker
#72
lchu-ibm
closed
2 months ago
0
Improve the speed of fms-hf converter
#71
lchu-ibm
closed
2 months ago
0
A revisit on improving the performance of Data Loader
#70
lchu-ibm
opened
2 months ago
2
Dataloader updates
#69
daviswer
closed
2 months ago
8
centralize place for defining block
#68
lchu-ibm
closed
3 months ago
0
Unable to Replicate MFU for 7B on 80gb A100
#67
jasonkrone
closed
3 months ago
3
[speculator training] Support for loading different HF checkpoints for speculator training
#66
pavi2707
opened
3 months ago
1
switch to new meta device init method
#65
lchu-ibm
closed
3 months ago
5
A write-up on Meta Device Init x Pretraining
#64
lchu-ibm
opened
3 months ago
0
maximize mistral throughput
#63
aldopareja
opened
3 months ago
2
[peculator training] Update benchmark_speculator_logical.py to support gpt_bigcode/granite
#62
sahilsuneja1
opened
3 months ago
9
add support for converting compiled model to hf
#61
lchu-ibm
closed
3 months ago
5
add support for rank0 only profiler
#60
lchu-ibm
closed
3 months ago
1
make fms-to-hf support for "compiled" model
#59
lchu-ibm
closed
3 months ago
0
add Rank0-only profiler
#58
lchu-ibm
closed
3 months ago
0
more flexible selective ac
#57
lchu-ibm
closed
3 months ago
3
make selective ac more flexible.
#56
lchu-ibm
closed
3 months ago
9
add Aim support
#55
lchu-ibm
closed
3 months ago
4
fix meta device initialization for very large models
#54
mayank31398
closed
3 months ago
6
revert "raise Dynamo accumulated cache size limit"
#53
lchu-ibm
opened
3 months ago
0
fix grow factor in hf conversion
#52
lchu-ibm
closed
3 months ago
2
increase accumulated_cache_size_limit to 128 to make 70b compile-able
#51
lchu-ibm
closed
3 months ago
0
lint
#50
lchu-ibm
closed
3 months ago
0
re-order compile fix
#49
lchu-ibm
closed
3 months ago
0
fix compile rope
#48
lchu-ibm
closed
3 months ago
0
add rope fix for rope to work with compile
#47
lchu-ibm
closed
3 months ago
0
Next