-
Hi, what are the requirement for NVLINK to function. I have 2 machine - one is regular pcie 3090 - 2 x cards in nvlink - works good and nvlink shows activity via :
nvidia-smi nvlink -gt r
and DGX-1…
-
@pawaitemadisoncollege
I believe that the learning never ends and i have so much to learn to be a successful developer. There was a lot of information this semester and i might have not been able to…
-
This the script I used for fine tuning.
```
export HF_DATASETS_OFFLINE=1
export TRANSFORMERS_OFFLINE=1
export PDSH_RCMD_TYPE=ssh
# NCCL setting
export GLOO_SOCKET_IFNAME=bond0
export NCCL_SO…
-
I came across your [post on Medium](https://medium.com/@kuza55/transparent-multi-gpu-training-on-tensorflow-with-keras-8b0016fd9012#.q4rzb8rik) and was instantly hooked. Nice job!
I've been develop…
-
Trying to run the training for the BERT-large topology, unpadded. We set up an nvidia-docker to run the training workload. However, we run into an error for the unpadded run. Here's an excerpt from th…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues
### Current Behavior
查看 ds_train_finetune.sh 文件
```
cat ds_train_finetune.sh
LR=1e-4
MASTER_PORT=$(shuf -…
-
**Is your feature request related to a problem? Please describe.**
As we continue to move towards less paper used at an event, one feature that stands out is the announcer report. It would be nice t…
-
Thread: pdtkmj-Wk-p2#comment-1599
Create a filter that adds more weight to results that are from a list of authors/contributors.
-
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3765 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3766 closing signal SIGTERM
ER…
-
**[环境]** CU11.7,torch1.13.1,T4卡,ubunut16.04,nvidia-dali-cuda110 1.25.0,nccl2.14.3,gcc7.5
**[运行命令]** `python -m torch.distributed.launch --nproc_per_node=4 train.py configs/culane_res18.py`
**[终端…