issues
search
aws
/
sagemaker-training-toolkit
Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496
stars
118
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Publish wheels to PyPI
#174
hajapy
opened
1 year ago
0
Passing SIGTERM to entrypoint to be able to handle SPOT failures gracefully in user-code
#173
croth1
opened
1 year ago
0
fix: SMTrainingCompilerConfigurationError takes no keyword argument
#172
ShiboXing
closed
1 year ago
0
Fix: Add SMTrainingCompilerConfigurationError to the list of registered exception classes.
#171
vinayburugu
closed
1 year ago
4
change: update libraries for SMDDP collectives validation
#170
vishwakaria
closed
1 year ago
0
Upgrade protobuf to prevent conflicts with smdebugger.
#169
josephevans
closed
1 year ago
0
Feature: To modify pytorch_xla configuration errors to SMTrainingCompilerConfigurationError
#168
vinayburugu
closed
1 year ago
0
Support CodeArtifact repositories for installing Python packages
#167
humanzz
closed
1 year ago
0
Stack based error attribution for errors arising from compiler code
#166
vinayburugu
closed
1 year ago
15
Add support for p4de instances, update when FI_EFA_USE_DEVICE_RDMA flag is set to only p4d{e} instances.
#165
josephevans
closed
1 year ago
2
Remove magic strings for attributes like instance type
#164
vishwakaria
opened
1 year ago
0
Fix: To add script to build tensorflow container for integration tests
#163
vinayburugu
closed
1 year ago
2
feature: add support for SMDDP collectives to smdataparallel runner
#162
vishwakaria
closed
1 year ago
8
Python 3.6 unsupported [bug/question]
#161
adamwrobel-ext-gd
opened
2 years ago
1
Feature: Stack trace based failure attribution for SageMaker Training Compiler
#160
vinayburugu
closed
1 year ago
6
add general exception to filter
#159
roywei
closed
2 years ago
4
Mpi mode sets all nodes to the same SM_CURRENT_HOST
#158
verdimrc
opened
2 years ago
0
Feature: Register tensorflow and xla exception classes to sagemaker-t…
#157
vinayburugu
closed
2 years ago
10
Improve coverage and fix collections DeprecationWarning
#156
satishpasumarthi
closed
2 years ago
3
CVE-2007-4559 Patch
#155
TrellixVulnTeam
opened
2 years ago
0
feature: Add torch_distributed support for Trainium instances in SageMaker
#154
satishpasumarthi
closed
2 years ago
12
feature: Add neuron cores support
#153
satishpasumarthi
closed
2 years ago
1
Feature: Add Neuron core support
#152
satishpasumarthi
closed
2 years ago
1
feature: Register tensor flow and xla exception classes with sagemaker-training-toolkit
#151
vinayburugu
closed
2 years ago
62
add tensor flow exception classes to the list of exception_classes…
#150
vinayburugu
closed
2 years ago
0
change: integrate upcoming dataparallel change to modelparallel
#149
yongyanrao
closed
2 years ago
3
Avoid deprecated import via collections.abc.Mapping
#148
lorenzwalthert
closed
2 years ago
5
Fix: Args for worker nodes in smdataparallel jobs
#147
satishpasumarthi
closed
2 years ago
1
Add debugger exception to error classes
#146
yl-to
closed
2 years ago
20
fix: Improve worker nodes waiting mechanism in MPI jobs
#145
satishpasumarthi
closed
2 years ago
15
fix: Enable PT XLA distributed training on homogeneous clusters
#144
Lokiiiiii
closed
2 years ago
2
Fix: adding EFA specific setup to distributed training runner for PT-XLA
#143
Lokiiiiii
closed
2 years ago
1
change: update num_processes_per_host for smdataparallel runner
#142
vishwakaria
closed
2 years ago
1
fix: Removed version hardcoding for sagemaker test dependency
#141
jleeleee
closed
2 years ago
1
relax exception type
#140
roywei
closed
2 years ago
8
change: update distribution_instance_group for pytorch ddp
#139
vishwakaria
closed
2 years ago
2
Specify flake8 config file explicitly
#138
nish21
closed
2 years ago
4
Feature: Create a new distribution mechanism for PT-XLA
#137
Lokiiiiii
closed
2 years ago
44
fix: handle utf-8 decoding exceptions while processing std streams
#136
vishwakaria
closed
2 years ago
1
feature: Heterogeneous cluster changes
#135
satishpasumarthi
closed
2 years ago
1
update: protobuf version to overlap with TF requirements
#134
nish21
closed
2 years ago
1
SM library telemetry improvement
#133
roywei
closed
1 year ago
2
Version 4.1.4 fails to install because of the missing protobuf dependency
#132
szafranek
closed
2 years ago
2
Fix none exception class issue for mpi
#131
haohanchen-aws
closed
2 years ago
5
Feature: Adding new parameter for TF Multi Worker Mirrored Strategy
#130
Lokiiiiii
closed
2 years ago
4
No support for Python 3.10
#129
peter-wimsey
opened
2 years ago
9
Hyperparameters not shell escaped
#128
bstriner
opened
2 years ago
0
Shlex quote
#127
bstriner
opened
2 years ago
2
Pass SIGTERM to training subprocess
#126
bstriner
opened
2 years ago
9
Pass SIGTERM to training script to stop training
#125
bstriner
opened
2 years ago
0
Previous
Next