distributed-llm Search Results

1000+ results
for distributed-llm

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

mosaicml/streaming #381

Compatibility with transformers.Trainer

## 🚀 Feature Request Currently StreamingDataset handles distributed data parallel training by itself. This makes it incompatible with Trainers that handles data distribution, such as transformers.Tra…

lorabit110 updated 1 month ago
7
NVIDIA/TensorRT-LLM #1173

BERT Model is Inaccurate

### System Info - CPU: i9 9900k - GPU: RTX 4090 - TensorRT-LLM Version: 0.9.0.dev2024022000 - Cuda Version: Cuda 12.3 - Driver Version: 545.29.06 - OS: Arch Linux, kernel version 6.7.5 ### …

Broyojo updated 3 months ago
6
pytorch/pytorch #130530

Fail to offload FSDP model weights and optimizer states with…

### 🚀 The feature, motivation and pitch Hi Pytorch maintainers, I am currently engaged in training multiple large language models (LLMs) sequentially on a single GPU machine, utilizing FullShard…

PeterSH6 updated 1 month ago
1
stanfordnlp/dspy #249

Deployment of a compiled program

Currently the compiled programs are not async and hence are not efficient to serve using a python server. It would be useful to merge the PRs aiming to add async across the dspy library. This coul…

sutyum updated 1 month ago
16
open-mmlab/mmengine #1539

ValueError: torch.utils.checkpoint: please pass in use_reent…

### Prerequisite - [X] I have searched [Issues](https://github.com/open-mmlab/mmengine/issues) and [Discussions](https://github.com/open-mmlab/mmengine/discussions) but cannot get the expected help. …

apachemycat updated 1 week ago
1
intelligent-machine-learning/dlrover #1146

megatron-lm flash-ckpt can not save ckpt to disk when use pi…

For megatron-lm train with flash-ckpt, when set `pipeline parallel`, can not save sucessfully. It seems to not all ckpt save to memory. `Skip persisting the checkpoint of step 60 because the cached …

Lzhang-hub updated 3 weeks ago
6
bilibili/Index-1.9B #16

如何微调？

jasonisme123 updated 1 month ago
5
yichen-byte/medical-chatbot #3

在训练的时候提示：ValueError: Template chatglm3_raw does not exist.

LLaMA-Factory是不是版本啥的不对了？还是我没安装好？全部的报错如下： ` (langchain) zeng@zeng:~/llm/medical-chatbot$ sh run_training.sh 04/29/2024 15:43:19 - WARNING - llmtuner.hparams.parser - We recommend enable mixed p…

zengraoli updated 4 weeks ago
3
coreweave/ml-containers #42

[feature-request] Support for JAX container

Concise Description: I'd like to use JAX for distributed training of LLMs. In addition, the new release of Keras supports JAX as a backend in addition to TF. Describe the solution you'd like I'd …

sbhavani updated 4 months ago
1
ray-project/ray #45822

[Core|Dataset] Ray job stuck with idle actors with no tasks

### What happened + What you expected to happen **What happened** Our ray job intermittently gets stuck. The Ray job is submitted using the RayJob CRD. We use ray data to read dataset and map ba…

pravingadakh updated 1 week ago
4

上一页 1...10 11 12 13 14 15 16...100 下一页

1000+ results for distributed-llm

1000+ results
for distributed-llm