distributed-llm Search Results

1000+ results
for distributed-llm

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

tensorflow/models #11142

Parameter server can not run

environment: CUDA 1.17 tensorflow2.14 code: https://github.com/tensorflow/models/blob/master/official/recommendation/ncf_keras_main.py command: python3 /LLM/models/official/recommendation/…

xiaobai52HZ updated 9 months ago
2
PygmalionAI/aphrodite-engine #774

[Installation]: AMD MI60 (gfx906) installation errors with R…

### Your current environment ```sh python env.py Collecting environment information... PyTorch version: 2.6.0.dev20241011+rocm6.2 Is debug build: False CUDA used to build PyTorch: N/A ROCM us…

Said-Akbar updated 3 weeks ago
10
pytorch/torchtune #1107

Save intermediate checkpoints during training

Is there any way to save intermediate checkpoints during training? Sometimes my training may fail during the middle due to external reasons, it will be helpful to save every N steps so I can contin…

l3utterfly updated 3 months ago
11
ray-project/ray #48556

[core][compiled-graphs] A MPMD Graph controller focus on N-M…

### Description ### **Concept introduction** The fact that SPMD has no scheduling overhead gives it the best performance, but it is often not easy enough to develop complex training tasks. For exa…

MoFHeka updated 3 days ago
5
vllm-project/vllm #5664

[Usage]: Does class LLM support inference quantization on CP…

### Your current environment Hey Team, I was experimenting with class **LLM** using gptq_marlin on the GPU, and it is incredibly fast. However, when I tried running it on the CPU, it seems that …

rsong0606 updated 2 weeks ago
2
PaddlePaddle/PaddleNLP #8593

[Question]: 进行chatglm2 lora微调时，设置pipeline parallel:4，报错 modu…

### 请提出你的问题 - 前提条件：单机单卡已经跑通chatglm2的lora微调训练代码；llama的多卡pp并行训练已跑通。 - 问题场景：想进一步尝试单机多卡，设置/chatglm2/lora_argument.json配置文件中的 "pipeline_parallel_degree": 4，然后参照官网样例，启动命令行：`srun --gres=gpu:4 python3 -u …

shanyuaa updated 3 weeks ago
5
vllm-project/vllm #5060

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Ba…

### Your current environment docker image: vllm/vllm-openai:0.4.2 Model: https://huggingface.co/alpindale/c4ai-command-r-plus-GPTQ GPUs: RTX8000 * 2 ### 🐛 Describe the bug The model works f…

heungson updated 2 weeks ago
40
meta-llama/codellama #60

torchrun --nproc_per_node 2 example_instructions.py --ckpt_d…

WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, plea…

alvynabranches updated 1 year ago
2
flowersteam/Grounding_LLMs_with_online_RL #11

How to run the train_language_agent.py without using slurm

Hi, Because i don't know how to use the slurm, i try to directly run the train_lanuage_agent.py as the command in lamorel `python -m lamorel_launcher.launch --config-path /home/yanxue/Groun…

yanxue7 updated 1 year ago
2
vllm-project/vllm #8881

[Bug]: assert len(self._async_stopped) == 0

### Your current environment The output of `python collect_env.py` ```text # For security purposes, please feel free to check the contents of collect_env.py before running it. python collect_e…

sfc-gh-zhwang updated 3 weeks ago
6

上一页 1...20 21 22 23 24 25 26...100 下一页

1000+ results for distributed-llm

1000+ results
for distributed-llm