-
Hi,
I'm using tf_geometric in a distributed fashion for a node classification problem. My BatchGraph contains thousands of Graph's and spans over several hundreds of gigabytes.
When using the GCN …
-
Hi, It seems that the same code is **working fine with when the Megatron-LM that I git-cloned in April. With the latest Megatron-LM, I've got the following error raised with the pretrain_gpt.py code. …
-
## 🐛 Bug description
Metric computation does not work properly in distributed settings when some processes do not handle any batch in the dataset. It becomes a problem when small validation or test…
linhr updated
3 years ago
-
## ❓ Questions and Help
**Description**
I built a dataset from my corpus, and use each line as an Example.
It works fine at first until I try to use it for distributed training.
It seems t…
-
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=1 --master_port=10001 --master_addr = [server ip] main_pretrain.py \
--backbone 'resnet5…
-
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=1 --master_port=10001 --master_addr = [server ip] main_pretrain.py \
--backbone 'resnet5…
-
Hi ! I'm Quentin from Hugging Face :)
Congrats on this project, this has the potential to help the community so much ! Especially with large scale and multimodal datasets.
I was wondering if you…
-
After I deployed the environment as required, I encountered a problem when reproducing the VQA task. The following error occurred when running the evaluate_vqa_rad_beam_scale.sh file. I hope to get yo…
-
Encounter this error when trying to train GoPro datasets:
`python -m torch.distributed.launch --nproc_per_node=1 --master_port=4321 train.py -opt options/train/GoPro/NAFNet-width32.yml --launcher pyt…
-
Hi,
Is there a simple way to run this code on a webdataset?
Thanks!