issues
search
aws
/
sagemaker-training-toolkit
Train machine learning models within a š³ Docker container using š§ Amazon SageMaker.
Apache License 2.0
496
stars
118
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Build failure on MacOS
#225
DRKolev-code
opened
1 day ago
0
[TEST]
#224
SecurityResearcher-yoda
closed
2 months ago
0
Fix: Preserve hyperparameter order when invoking training jobs
#223
vsimkus
opened
2 months ago
0
Silent Failure if custom image puts something into /opt/ml/code
#222
njbrake
opened
2 months ago
0
SageMaker training toolkit reorders hyperparameters
#221
vsimkus
opened
2 months ago
0
feature: Add p5 as a supported NCCL instance
#220
andjsmi
closed
2 months ago
0
Add 'ml.p5.48xlarge' as a supported instance for SM_EFA_NCCL_INSTANCES.
#219
andjsmi
opened
2 months ago
0
Extend documentation regarding distributed training for own Docker containers.
#218
marseller
opened
2 months ago
1
fix: typo in the run unit tests command
#217
bhaoz
closed
3 months ago
0
fix: run unit tests in sequence order for release process as well to prevent coverage conflicting issues
#216
bhaoz
closed
3 months ago
0
chore: removing unnecessary logging information
#215
bhaoz
closed
3 months ago
0
feature: Add support for py39 and py310
#214
prtsh
closed
3 months ago
1
Validate smddprun() fails with file not found error on AL2023
#213
jimmyrigby94
closed
7 months ago
1
build test
#212
emeraldbay
opened
8 months ago
0
feature: add python module entrypoint type, add python module supportā¦
#211
clumsy
opened
8 months ago
1
feature: add python module entrypoint type, add python module supportā¦
#210
clumsy
closed
8 months ago
1
feature: add python module entrypoint type, add python module supportā¦
#209
clumsy
closed
8 months ago
1
Add TFlops calculator and stuck job monitor
#208
emeraldbay
opened
9 months ago
0
Get region with ENV var
#207
austinmw
opened
9 months ago
0
Invalid dash-separated options for description-file
#206
wickeat
opened
11 months ago
0
feature: add python module entrypoint type, add python module support to torch_distributed
#205
clumsy
closed
8 months ago
5
Training Job "Successful" despite failing due to 100% disk usage
#204
david-waterworth
opened
1 year ago
0
change: update the boto deps to use latest boto
#203
mufaddal-rohawala
closed
1 year ago
0
change: bypass DNS check for studio local exec
#202
mufaddal-rohawala
closed
1 year ago
0
fix: toolkit build failure
#201
emeraldbay
closed
1 year ago
1
ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code
#200
celsofranssa
opened
1 year ago
0
test
#199
emeraldbay
closed
1 year ago
0
fix: use smddprun only if it is installed
#198
ruhanprasad
closed
1 year ago
1
fix: Remove Python 3.7 to fix the CI
#197
emeraldbay
closed
1 year ago
0
fix: Add NCCL_PROTO=simple environment variable to handle the out-of-orderā¦
#196
ruhanprasad
closed
1 year ago
0
fix: Test CI
#195
emeraldbay
closed
1 year ago
0
fix: SMDDP does not support P5 instances with SMP
#194
apoorvtintin
closed
1 year ago
1
Issue when training in local mode with huggingface training container
#193
ojturner
opened
1 year ago
1
fix: SMDDP does not support P5 instances with SMP
#192
apoorvtintin
closed
1 year ago
2
P5 instance support
#191
haozhx23
opened
1 year ago
0
feat: Initial change for Sagemaker provided health check
#190
emeraldbay
closed
1 year ago
0
feat: support codeartifact for installing requirements.txt packages
#189
humanzz
closed
1 year ago
2
dummy commit to test CI/CD
#188
emeraldbay
closed
1 year ago
0
feat: support codeartifact for installing requirements.txt packages
#187
humanzz
closed
1 year ago
5
Adding sys.path to PYTHONPATH breaks virtual environments
#186
pdveenstra
opened
1 year ago
0
Add SM dataparallel exception class in mpi distribution
#185
stu1130
closed
1 year ago
1
Deepspeed Launcher
#184
anupam-dewan
opened
1 year ago
0
Added supported for neuron_parallel_compile for trn1 (trainium)
#183
VijayNiles
closed
1 year ago
1
Add NCCL_ALGO env var for modelparallel jobs
#182
yongyanrao
closed
1 year ago
2
unpin sagemaker version as the credential issue fixed
#180
yl-to
closed
1 year ago
0
Testing PR for SageMaker version
#179
yl-to
closed
1 year ago
0
fix: increase worker waiting time for ORTE proc
#178
yl-to
closed
1 year ago
1
change: upagrade protobuf version for tensorflow 2.12
#177
yl-to
closed
1 year ago
0
fix: Revert SMDDP collectives feature from smdataparallel runner
#176
vishwakaria
closed
1 year ago
0
Fix: to fix SMTrainingCompilerConfigurationError handling in process.py
#175
vinayburugu
closed
1 year ago
8
Next