-
To those who are struggling to replicate this. Below is the google colab version which can run on python3.10 linux google colab
!git clone https://github.com/Yujun-Shi/DragDiffusion.git
%cd /conte…
-
# Update
It seems the have something to do with `--machine.num-devices 8`. Without that argument the training works as expected, at least for `nerfacto`. I will test out `splatfacto` later since the …
-
**Describe the bug**
When I upgrade to DeepSpeed 0.14.3, training does not progress because all gradients and gradient norms are zero. From using git bisect, I think it's from this PR:
https://git…
-
### What type of bug is this?
Unexpected error
### What subsystems are affected?
Distributed Cluster, Query Engine
### Minimal reproduce step
1. Boot GreptimeDB cluster (Minio + Disk Cache)
2. R…
-
### 🐛 Describe the bug
NCCL backend isend will block if no matching irecv from peer; Run the below script with 2 workers will result in: rank 1 finishes, but rank 0 hang.
However, if you switch fr…
-
### 🚀 The feature, motivation and pitch
Occasionally contributing a c++ or cuda PR could be a very daunting task cause the required computing resources and time to completely compile pytorch from s…
bhack updated
3 months ago
-
## ❓ Questions and Help
#### What is your question?
I'm getting oom while training wav2vec with multi-gpus environments and it freeze I guess. It recovers when I run with single gpu.
NCC…
-
### 🐛 Describe the bug
# problem
when frozon module have unused gradable input, reshard happens without unshard, leading to runtime assertion error "Expects storage to be allocated"
* unshard won…
-
Hazelcast is continously throwing `java.io.IOException: Packet not send to [10.60.0.229]:5701` exception. This happens after one time error of `java.lang.NoClassDefFoundError: com/hazelcast/internal/n…
-
Hi APEX,
Can you please suggest how to work around the failed "c10d no_copy" assertion in
https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_lamb.py#L140?
```
…