-
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by runn…
-
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by runn…
-
## 一言でいうと
doc2vecの手法を、グラフに適用した手法。文書の長さが異なってもdoc2vecが使えるように、グラフサイズが異なっても表現が得られる。グラフ全体を文書・グラフからrootをもつサブグラフをサンプリングしたものを単語とみなし表現の更新を行う。コードの依存グラフからマルウェア検知を行っている。
![image](https://user-images.githubu…
-
### Describe the bug
Running train_controlnet_flux.py with multiple gpus results in a NCCL timeout error after N iterations of train_dataset.map(). This error can be partially solved by initializing …
-
**Is your feature request related to a problem? Please describe.**
Expanding Identus capabilities through seamless connectivity with decentralized web nodes, enabling a versatile and distributed se…
-
Hi @lucidrains, thanks for this implementation.
I wonder if you're using distributed training for your [experiments](https://wandb.ai/lucidrains/lion-test/reports/Lion--VmlldzozNTY0OTQ0?accessToken…
-
Hi all,
I'm not sure whether this is the right way to ask a question, and this question is strictly speaking outside of the scope of the SIG as defined in the README, but I'm hoping that someone ca…
-
[rank1]: Traceback (most recent call last):
[rank1]: File "/storage/garlin/deep_learning/finetune-Qwen2-VL/finetune_distributed.py", line 200, in
[rank1]: train()
[rank1]: File "/storage/g…
-
hello,
I test the llama2-70b-lora,but replace model with llama2-7b on 2 gpu 4090 node
running log:
```
Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and…
-
## Paste the link of the GitHub organisation below and submit
https://github.com/dmlc
---
###### Please subscribe to this thread to get notified when a new repository is created