issues
search
aws
/
sagemaker-training-toolkit
Train machine learning models within a š³ Docker container using š§ Amazon SageMaker.
Apache License 2.0
488
stars
117
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[TEST]
#224
SecurityResearcher-yoda
closed
2 weeks ago
0
Fix: Preserve hyperparameter order when invoking training jobs
#223
vsimkus
opened
3 weeks ago
0
Silent Failure if custom image puts something into /opt/ml/code
#222
njbrake
opened
3 weeks ago
0
SageMaker training toolkit reorders hyperparameters
#221
vsimkus
opened
3 weeks ago
0
feature: Add p5 as a supported NCCL instance
#220
andjsmi
closed
3 weeks ago
0
Add 'ml.p5.48xlarge' as a supported instance for SM_EFA_NCCL_INSTANCES.
#219
andjsmi
opened
3 weeks ago
0
Extend documentation regarding distributed training for own Docker containers.
#218
marseller
opened
1 month ago
1
fix: typo in the run unit tests command
#217
bhaoz
closed
1 month ago
0
fix: run unit tests in sequence order for release process as well to prevent coverage conflicting issues
#216
bhaoz
closed
1 month ago
0
chore: removing unnecessary logging information
#215
bhaoz
closed
1 month ago
0
feature: Add support for py39 and py310
#214
prtsh
closed
1 month ago
1
Validate smddprun() fails with file not found error on AL2023
#213
jimmyrigby94
closed
6 months ago
1
build test
#212
emeraldbay
opened
6 months ago
0
feature: add python module entrypoint type, add python module supportā¦
#211
clumsy
opened
6 months ago
1
feature: add python module entrypoint type, add python module supportā¦
#210
clumsy
closed
6 months ago
1
feature: add python module entrypoint type, add python module supportā¦
#209
clumsy
closed
6 months ago
1
Add TFlops calculator and stuck job monitor
#208
emeraldbay
opened
7 months ago
0
Get region with ENV var
#207
austinmw
opened
7 months ago
0
Invalid dash-separated options for description-file
#206
wickeat
opened
9 months ago
0
feature: add python module entrypoint type, add python module support to torch_distributed
#205
clumsy
closed
6 months ago
5
Training Job "Successful" despite failing due to 100% disk usage
#204
david-waterworth
opened
10 months ago
0
change: update the boto deps to use latest boto
#203
mufaddal-rohawala
closed
11 months ago
0
change: bypass DNS check for studio local exec
#202
mufaddal-rohawala
closed
11 months ago
0
fix: toolkit build failure
#201
emeraldbay
closed
11 months ago
1
ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code
#200
celsofranssa
opened
11 months ago
0
test
#199
emeraldbay
closed
11 months ago
0
fix: use smddprun only if it is installed
#198
ruhanprasad
closed
11 months ago
1
fix: Remove Python 3.7 to fix the CI
#197
emeraldbay
closed
11 months ago
0
fix: Add NCCL_PROTO=simple environment variable to handle the out-of-orderā¦
#196
ruhanprasad
closed
11 months ago
0
fix: Test CI
#195
emeraldbay
closed
11 months ago
0
fix: SMDDP does not support P5 instances with SMP
#194
apoorvtintin
closed
1 year ago
1
Issue when training in local mode with huggingface training container
#193
ojturner
opened
1 year ago
0
fix: SMDDP does not support P5 instances with SMP
#192
apoorvtintin
closed
11 months ago
2
P5 instance support
#191
haozhx23
opened
1 year ago
0
feat: Initial change for Sagemaker provided health check
#190
emeraldbay
closed
11 months ago
0
feat: support codeartifact for installing requirements.txt packages
#189
humanzz
closed
1 year ago
2
dummy commit to test CI/CD
#188
emeraldbay
closed
1 year ago
0
feat: support codeartifact for installing requirements.txt packages
#187
humanzz
closed
1 year ago
5
Adding sys.path to PYTHONPATH breaks virtual environments
#186
pdveenstra
opened
1 year ago
0
Add SM dataparallel exception class in mpi distribution
#185
stu1130
closed
1 year ago
1
Deepspeed Launcher
#184
anupam-dewan
opened
1 year ago
0
Added supported for neuron_parallel_compile for trn1 (trainium)
#183
VijayNiles
closed
1 year ago
1
Add NCCL_ALGO env var for modelparallel jobs
#182
yongyanrao
closed
1 year ago
2
unpin sagemaker version as the credential issue fixed
#180
yl-to
closed
1 year ago
0
Testing PR for SageMaker version
#179
yl-to
closed
1 year ago
0
fix: increase worker waiting time for ORTE proc
#178
yl-to
closed
1 year ago
1
change: upagrade protobuf version for tensorflow 2.12
#177
yl-to
closed
1 year ago
0
fix: Revert SMDDP collectives feature from smdataparallel runner
#176
vishwakaria
closed
1 year ago
0
Fix: to fix SMTrainingCompilerConfigurationError handling in process.py
#175
vinayburugu
closed
1 year ago
8
Publish wheels to PyPI
#174
hajapy
opened
1 year ago
0
Next