distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

mosaicml/llm-foundry #1231

LLaMA PRO training resume problem

Hello, I'm currently training LLaMA PRO. Initially, I expanded the model from 32 layers to 40 layers and proceeded to train only the newly added 8 layers (every fifth layer). Therefore, I froze 32 …

germanjke updated 3 days ago
6
chandlerbing65nm/FakeImageDetection #1

为什么训练20多轮后不动了

200N updated 3 months ago
3
deJQK/FracBits #2

Need help with pyarrow configuration

Hi @deJQK, Seems like you are using apache arrow for distributed training. Can you explain more on how to configurate environment for pyarrow? I cannot start training as i always get `FileNotFou…

Akimoto-Cris updated 2 years ago
2
RangiLyu/nanodet #285

ProcessGroupNCCL.cpp:784, unhandled system error

(nanodet) simon@Simon:~/nanodet$ python tools/train.py config/nanodet-m-416.yml [root][07-16 11:17:37]INFO:Using Tensorboard, logs will be saved in workspace/nanodet_m_416/logs [root][07-16 11:17:37…

waduhekx updated 2 years ago
4
Ree1s/IDM #12

distributed

I have a question about distributed training, how to run the idm_main.py file on my single graphics card window computer.My question is that "RuntimeError: Default process group has not been initializ…

pybnafp updated 7 months ago
1
THUDM/ChatGLM2-6B #494

[BUG/Help] <title>dataclasses.FrozenInstanceError: cannot as…

### Is there an existing issue for this? - [X] I have searched the existing issues ### Current Behavior transformers/training_args.py", line 1712, in __setattr__ raise FrozenInstanceError(f"ca…

leoluopy updated 2 months ago
9
jia-zhuang/pytorch-multi-gpu-training #2

指出一点改进

并行化模型的时候可以加上一步操作： ```python # Convert BatchNorm to SyncBatchNorm. net = nn.SyncBatchNorm.convert_sync_batchnorm(net) ``` 确保batch norm 在所有process上sync了。参考： https://theaisummer.com/distribute…

pengzhangzhi updated 11 months ago
1
tianrun-chen/SAM-Adapter-PyTorch #39

!torchrun train.py --config configs/demo.yaml may perform fa…

Hello authors, Thanks so much for sharing these codes. The codes are very useful to fine-tune SAM for downstream works : ) I reduced datasize, adapted the codes and run them in **Google Colab w…

YunyaGaoTree updated 1 month ago
7
proger/hippogriff #5

Initialization of lambda incorrect

Hello, I noticed a deviation from the Griffin paper in your code. The Griffin paper states in the second part of chapter 2.4: > We initialize Λ such that a^c is uniformly distributed between 0.…

ozppupbg updated 3 months ago
2
OFA-Sys/Chinese-CLIP #60

打包模型出现问题

你好，请教一下，训练的时候，出现如下问题： `cd Chinese-CLIP/ bash run_scripts/muge_finetune_vit-b-16_rbt-base.sh ${DATAPATH}` 出现下面的问题： `root@clip-test-d9cd48656-q2zbl:~/workspace/clip/Chinese-CLIP# bash run_scripts/…

JeffMony updated 1 year ago
1

上一页 1...85 86 87 88 89 90 91...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training