issues
search
aws
/
sagemaker-training-toolkit
Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496
stars
118
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix: fix flaky issue with incorrect rc being given
#124
matherit
closed
2 years ago
2
Use framework provided error class and stack trace as error message
#123
roywei
closed
2 years ago
17
fix: missing args when shell script is used
#122
satishpasumarthi
closed
2 years ago
6
add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22
#121
ydaiming
closed
2 years ago
15
feature: Add Native Pytorch DDP Support
#120
satishpasumarthi
closed
2 years ago
10
Arguments not always accessible when using bash script for training job
#119
marcelgwerder
closed
2 years ago
3
Enable custom failure logging
#118
satishpasumarthi
closed
2 years ago
5
Should to_cmd_args pass complex types through json.dumps instead of str?
#117
croth1
closed
2 years ago
1
Switch to asyncio.create_subprocess_exec
#116
unoebauer
closed
2 years ago
7
Hyperparameters and other cmd arguments are not passed to shell entrypoint in tensorflow > 2.4
#115
unoebauer
closed
2 years ago
2
WIP - Allow entrypoint definition via python
#114
ghost
closed
2 years ago
1
Add Custom_Overrides flag.
#113
mathephysicist
closed
2 years ago
1
Custom_Overrides
#112
mathephysicist
opened
2 years ago
0
Feature: Allow user script override of failure reason file
#111
athewsey
closed
2 years ago
1
Unable to install sagemaker-training on Windows
#110
martinlyra
opened
3 years ago
4
breaking: Add py38, dropped py36 and py2 support. Bump pypi to 4.0.0 …
#109
satishpasumarthi
closed
3 years ago
2
Fix logging issues
#108
satishpasumarthi
closed
3 years ago
21
SageMaker Local Mode does not Inject default environment variables
#107
edgBR
closed
2 years ago
9
Reverted -x FI_EFA_USE_DEVICE_RDMA=1 to fix a crash on PyTorch Datalo…
#106
piyushghai
closed
3 years ago
1
How does sagemaker-training-toolkit complement sagemaker-python-sdk?
#105
yanhong-zhao-ef
opened
3 years ago
0
Don't set `sagemaker_s3_output` via hyperparameter
#104
samuel-massinon
opened
3 years ago
0
change: [smdataparallel] better messages to establish the SSH connection between workers
#103
karan6181
closed
3 years ago
4
feature: configure IF_NAME for SMDATAPARALLEL
#102
yselivonchyk
closed
2 years ago
7
feature: smdataparallel enable EFA RDMA flag
#101
karan6181
closed
3 years ago
4
feature: configure IF_NAME for SMDATAPARALLEL
#100
yselivonchyk
closed
3 years ago
1
feature: smdataparallel custom mpi options support
#99
karan6181
closed
3 years ago
9
Update Dockerfile to accomomdate Rust dependency.
#98
rajanksin
closed
3 years ago
2
Pass in environment variables for Estimator training job
#97
PiercingDan
closed
3 years ago
3
Change: smdataparallel change FI_PROVIDER to efa from sockets
#96
karan6181
closed
3 years ago
5
change: set btl_vader_single_copy_mechanism to none to avoid Read -1 Warning messages
#95
karan6181
closed
3 years ago
2
Add gcc package requirment
#94
timorkal
opened
3 years ago
0
how to pass predictor script for running batch transformers.
#93
AnkurShukla85
opened
3 years ago
0
SageMaker Endpoint stuck at “Creating”
#92
vas610
opened
3 years ago
0
Update PyPI classifiers to include py38
#91
seanpmorgan
closed
2 years ago
3
change: set btl_vader_single_copy_mechanism to none
#90
metrizable
closed
3 years ago
4
fix:decode binary stderr string before dumping it out
#89
icywang86rui
closed
3 years ago
2
feature: add reinvent 2020 features
#88
ChoiByungWook
closed
3 years ago
7
infra: use ECR-hosted image for ubuntu:16.04
#87
ajaykarpur
closed
3 years ago
4
fix: workaround for printing stderr
#86
sboshin
closed
3 years ago
3
Sagemaker Fails to download code from S3
#85
uwaisiqbal
opened
3 years ago
0
Which aws service is the most suitable/used to launch a scheduled training job?
#84
david-fortini
opened
3 years ago
0
Enable functional test for mpi
#83
ChaiBapchya
opened
4 years ago
0
Model and output files do not get saved to S3 when training own model
#82
fiocam
opened
4 years ago
5
documentation: typo fix on ENVIRONMENT_VARIABLES.md
#81
pbmartins
closed
4 years ago
3
How to save non-model artifacts from a container (output_data_dir)
#80
diegodebrito
opened
4 years ago
2
fix: propagate log level to aws services
#79
chuyang-deng
closed
4 years ago
13
drop local test skip since local mode fix got merged
#78
ChaiBapchya
closed
3 years ago
8
Enhance UX for training
#77
ehsanmok
opened
4 years ago
0
fix: check for script entry point even if setup.py is present
#76
ajaykarpur
closed
4 years ago
1
Bash script mode support across all estimators
#75
ehsanmok
closed
4 years ago
1
Previous
Next