-
We need to add support for Distributed training, we can directly make use of Pytorch DDP if we want as of now. Let me know if anyone wants to take this up.
-
Hi,I use one machine with 5 2080ti. On training, the steps dose not crease when it is 10.
![image](https://user-images.githubusercontent.com/24452502/115810088-e4882080-a41f-11eb-8b63-d25a8eaf36c7.p…
-
### Is your feature request related to a problem? Please describe.
Models greater than the GPU memory capacity cannot be currently run in inference, whilst parallel implementations in training exist.…
-
### What happened?
Launching
`HYDRA_FULL_ERROR=1 ANEMOI_BASE_SEED=1 anemoi-training train --config-name happy_little_config --config-dir=/pathToConfigs/config`
for training models in a multi-…
-
### System Info
训练是否支持分布式以及更大模型比较qwen72b?
### Who can help?
@morning9393
### Information
- [X] The official example scripts
- [X] My own modified scripts
### Tasks
- [x] An offic…
-
**Describe the bug**
This is an issue I am having with keras-nlp, but I am not sure if it can be solved here or should be reported under keras or tensorflow.
Currently, the batch size is not calc…
-
Looking for a way to train alignn in a distributed fashion I stumbled upon this package.
It looks really nice but I could not get the distributed training to work on slurm.
One issues was that the t…
-
nohup: ignoring input
/root/miniconda3/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.u…
-
Hi,
Here is my slurm file. I allocate 4 A100 cards with 64g RAM.
#!/bin/bash
###
#SBATCH --time=72:00:00
#SBATCH --mem=64g
#SBATCH --job-name="lisa"
#SBATCH --partition=gpu
#SBATCH --gr…
ruida updated
10 months ago
-