issues
search
aws
/
sagemaker-pytorch-training-toolkit
Toolkit for running PyTorch training scripts on SageMaker. Dockerfiles used for building SageMaker Pytorch Containers are at https://github.com/aws/deep-learning-containers.
Apache License 2.0
197
stars
87
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
change: Update sagemaker-training to 4.7.3
#256
staubhp
closed
11 months ago
1
ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code
#255
celsofranssa
closed
10 months ago
3
[FATAL tini (7)] exec train failed: No such file or directory
#254
celsofranssa
opened
11 months ago
0
"Train": executable file not found in $PATH
#253
celsofranssa
opened
11 months ago
0
change: bypass DNS check for studio local exec
#252
mufaddal-rohawala
closed
11 months ago
12
Fix: pin coverage version to fix pipeline issue
#251
yl-to
closed
1 year ago
4
Add PyTorch version environment variable, to facilitate SMTT
#250
yongyanrao
closed
1 year ago
6
feature: Add torch_distributed support for Trainium
#248
satishpasumarthi
closed
1 year ago
12
CVE-2007-4559 Patch
#247
TrellixVulnTeam
opened
1 year ago
0
documentation: update README and contributing guidelines
#246
satishpasumarthi
closed
2 years ago
4
Update README.rst with how it related to SMTT
#245
gilinachum
closed
2 years ago
3
fix: provide option to use native process launcher
#244
satishpasumarthi
closed
2 years ago
30
aaa
#243
cyberitech
closed
2 years ago
0
aaa
#242
cyberitech
closed
2 years ago
0
Feature: Support new distribution mechanism for PT-XLA
#241
Lokiiiiii
closed
2 years ago
8
Test/fix
#240
nish2104
closed
2 years ago
5
test: empty commit
#239
nish21
closed
2 years ago
4
fix: derive master node from training environment
#238
satishpasumarthi
closed
2 years ago
8
upodate
#237
gijayah213
closed
2 years ago
4
feature: add support for native PT DDP distribution
#236
vishwakaria
closed
2 years ago
28
feature: Add Heterogeneous Cluster support
#235
satishpasumarthi
closed
2 years ago
17
fix: CI changes
#234
satishpasumarthi
closed
2 years ago
29
empty commit to trigger ci
#233
nish21
closed
2 years ago
16
[bug] Torch does not find GPU on pytorch-training:1.10.0-gpu-py38 container
#232
sergii-ivakhno-kidsloop
opened
2 years ago
0
feature: Added Native Pytorch DDP support
#231
satishpasumarthi
closed
2 years ago
8
Environment variables set for NCCL and Distributed training are not passed onto the sagemaker-training entrypoint
#230
thecooltechguy
closed
3 years ago
1
model_fn is not recognized. Sagemaker Studio template for model building, training, and deployment
#229
babarory
opened
3 years ago
1
Dockerfile installation of torch and torchvision from s3, replacing original versions.
#228
akinolawilson
opened
3 years ago
0
Example use case
#227
akinolawilson
opened
3 years ago
2
Error importing torchaudio
#226
bbalaji-ucsd
opened
3 years ago
2
feature: add reinvent 2020 features
#225
ChoiByungWook
closed
3 years ago
73
fix: not raising excpetion if no image to delete
#224
chuyang-deng
opened
3 years ago
4
Getting cudnn error while training on ml.p2.xlarge instance
#223
shubham-scisar
closed
4 years ago
2
cannot recognize num_gpus for more than 1 gpu per instance
#222
zhaoanbei
closed
3 years ago
4
change: Update main buildspec to only perform CPU integration tests
#221
bveeramani
closed
4 years ago
15
change: Pin SageMaker version to less than v2
#220
bveeramani
closed
4 years ago
3
docs: Fix docstring style in training.py
#219
bveeramani
closed
4 years ago
6
change: Add GPU and unit test buildspecs
#218
bveeramani
closed
4 years ago
4
feature: Use MPIRunnerType
#217
bveeramani
closed
4 years ago
55
feature: update pytorch vanilla version to 1.6.0
#216
chuyang-deng
closed
4 years ago
3
FastAI v1.0.59 causes failed training job
#215
ghost
closed
4 years ago
1
infra: add issue templates
#214
ajaykarpur
closed
4 years ago
4
doc: remove confusing information from the Readme.
#213
nadiaya
closed
4 years ago
3
infra: do not duplicate test dependencies in tox.ini
#212
nadiaya
closed
4 years ago
20
fix: Rename buildspec files.
#211
nadiaya
closed
4 years ago
4
fix: bump version of sagemaker-training for script entry point fix
#210
ajaykarpur
closed
4 years ago
4
infra: Make docker folder read only, remove unused tests.
#209
nadiaya
closed
4 years ago
6
unable to build final dockerfile.cpu
#208
Vertika09
closed
4 years ago
4
Pytorch 1.5 build issue
#207
dwang-sflscientific
closed
4 years ago
2
change: install ipywidgets in 1.5.0 Python 3 Dockerfiles
#206
laurenyu
closed
4 years ago
2
Next