-
## 🚀 Feature Request
Currently StreamingDataset handles distributed data parallel training by itself. This makes it incompatible with Trainers that handles data distribution, such as transformers.Tra…
-
### System Info
- CPU: i9 9900k
- GPU: RTX 4090
- TensorRT-LLM Version: 0.9.0.dev2024022000
- Cuda Version: Cuda 12.3
- Driver Version: 545.29.06
- OS: Arch Linux, kernel version 6.7.5
### …
-
### 🚀 The feature, motivation and pitch
Hi Pytorch maintainers,
I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShard…
-
Currently the compiled programs are not async and hence are not efficient to serve using a python server. It would be useful to merge the PRs aiming to add async across the dspy library.
This coul…
-
### Prerequisite
- [X] I have searched [Issues](https://github.com/open-mmlab/mmengine/issues) and [Discussions](https://github.com/open-mmlab/mmengine/discussions) but cannot get the expected help.
…
-
For megatron-lm train with flash-ckpt, when set `pipeline parallel`, can not save sucessfully. It seems to not all ckpt save to memory.
`Skip persisting the checkpoint of step 60 because the cached …
-
-
LLaMA-Factory是不是版本啥的不对了?还是我没安装好?
全部的报错如下:
`
(langchain) zeng@zeng:~/llm/medical-chatbot$ sh run_training.sh
04/29/2024 15:43:19 - WARNING - llmtuner.hparams.parser - We recommend enable mixed p…
-
Concise Description:
I'd like to use JAX for distributed training of LLMs. In addition, the new release of Keras supports JAX as a backend in addition to TF.
Describe the solution you'd like
I'd …
-
### What happened + What you expected to happen
**What happened**
Our ray job intermittently gets stuck. The Ray job is submitted using the RayJob CRD. We use ray data to read dataset and map ba…