-
As introduced by Nitin in our meeting on 23 May, a new architecture (/protocol?) is proposed for performing MPC on distributed settings (e.g. Solid-like), calling it PPC evaluating. We are doing bench…
-
Hi, I was wondering if there were any efforts on great.py natively supporting Distributed Data Parallels? Currently I am doing a workaround by editing my own trainer file and saving it via torch save.…
-
The goal of this feature is to setup flowgraphs that can span multiple nodes.
A first pass at this feature, or proof of concept I think involves the following:
- [x] networked custom buffers for p…
mormj updated
2 years ago
-
```
Executing Cell 19--------------------------------------
INFO:notebook:Training the model...
INFO:training:Using cuda:0 of 1
INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models.
…
-
### Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
### Branch Name
main
### Commit ID
e442cbc (auto-test-e442cbc)
### Other Environment Inform…
-
Hi, I try the distributed training with 2 machines. There are 4 GPUs in each machine.
in the master machine, I run:
python -u tools/run_net.py \
--cfg configs/Kinetics/SLOWFAST_8x8_R50.yaml \
--…
-
To make our system become Distributed Architecture (i.e. increasing elasticity, resilience, scalability...),
there are something Soumya has taught in class
- [ ] (1) Concurrency Asynchronicity : Mu…
-
- Books
- Courseware
-
**Describe what's wrong**
Distributed queue gets stuck when sending big files or big batch of files if timeout is reached.
As the exception is being generated on the client side (the timeout happen…
-
Similar to NCCL tests for Kubernetes https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests/kubernetes - it would be great if there was a similar test for NCC…