issues
search
diux-dev
/
cluster
train on AWS
75
stars
15
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
p3 instances should self-terminate after 1 hour by default
#25
yaroslavvb
closed
6 years ago
2
Have shorter "Request valid until" on spot requests
#24
yaroslavvb
closed
6 years ago
2
Fail early when volume attachment fails
#23
yaroslavvb
closed
6 years ago
0
start high-perf instances with Detailed Monitoring enabled by default
#22
yaroslavvb
closed
6 years ago
1
add gdb, nload, extra packages to default install
#21
yaroslavvb
closed
6 years ago
1
Speed up experiments by reusing root EBS volumes
#20
yaroslavvb
closed
6 years ago
10
MPI flags
#19
yaroslavvb
closed
6 years ago
0
Feature: replace all prints with "logs" that include timestamps
#18
yaroslavvb
closed
6 years ago
0
Sometimes get " torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at torch/csrc/cuda/Module.cpp:32"
#17
yaroslavvb
opened
6 years ago
0
Change multiprocessing start method to "spawn"
#16
yaroslavvb
closed
6 years ago
1
synchronizing in fp16 instead of fp32
#15
yaroslavvb
closed
6 years ago
5
overlapping transfer and computation in PyTorch all_reduce
#14
yaroslavvb
closed
6 years ago
2
distributed checkpoint saving
#13
yaroslavvb
closed
6 years ago
1
Things hang when using multiple machines + dist-url:file: init method
#12
yaroslavvb
closed
6 years ago
1
Things seemingly hang in "Creating data loaders"
#11
yaroslavvb
closed
6 years ago
0
Bug in AdamW
#10
sgugger
closed
6 years ago
1
Jobs use existing stopped instances if found
#9
bearpelican
closed
6 years ago
0
keypair creation race condition
#8
yaroslavvb
closed
6 years ago
0
Use custom resources for PS machines.
#7
robertnishihara
closed
6 years ago
0
Fix "0 subnets, but 3 zones" error
#6
yaroslavvb
closed
6 years ago
0
simplify connect2/connect
#5
yaroslavvb
opened
6 years ago
0
clarify error message when zone doesn't match
#4
yaroslavvb
closed
6 years ago
0
Fix dtype and limit memory.
#3
robertnishihara
closed
6 years ago
0
Need timeout on instances
#2
yaroslavvb
closed
6 years ago
1
Trouble running ray_integration/launch_simple.py
#1
robertnishihara
closed
7 years ago
4
Previous