distributed-learning Search Results

1000+ results
for distributed-learning

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

huggingface/optimum-neuron #735

AttributeError: can't set attribute 'deepspeed_plugin'

### System Info ```shell accelerate 1.1.1 neuronx-cc 2.14.227.0+2d4f85be neuronx-distributed 0.8.0 neuronx-distributed-training 1.0.0 optimum …

anushka0415 updated 1 week ago
3
princeton-nlp/SimCSE #286

4 x 24 GB OOM for sup-SimCSE-BERT-base

Hi~ 首先非常感谢你们的开源工作，也很抱歉2024年还需要你们帮忙解决SimCSE相关的问题我试图在4张24GB的4090显卡，完成有监督SimCSE-BERT-base的训练我采用的.sh程序如下（只是把 `torch.distributed.launch` 替换为了 `torchrun`）： ``` #!/bin/bash # In this example,…

Anonymous-AI1 updated 15 hours ago
1
KTurnura/paper-notes #16

Distributed Deep Learning for Remote Sensing Data Interpreta…

论文链接：https://ieeexplore.ieee.org/document/9378798 本文介绍了一些应用于遥感数据图像领域的分布式深度学习算法，强调云服务架构在遥感图像数据管理、计算、服务提供等方面相比于其他并行和分布式架构（集群计算和网格计算）所能提供的便利。

KTurnura updated 11 months ago
1
lobehub/lobe-chat-agents #1078

[Agent] Linux工程师

### systemRole Key Attributes: Kernel Engineering Visionary: Leads the development of real-time kernels, enabling systems for high-frequency trading, robotics, and mission-critical applications…

zxc2590126260 updated 22 hours ago
2
dyweb/papers-notebook #97

Optimizing Distributed Execution of Deep Learning System

https://102.alibaba.com/fund/proposalTopicDetail.htm?id=1120 Alibaba Fund 的主题

gaocegege updated 6 years ago
2
huggingface/trl #2377

PPO Example Script Accelerator error: initialize your accele…

### System Info - Platform: Linux-5.10.227-219.884.amzn2.x86_64-x86_64-with-glibc2.26 - Python version: 3.10.15 - PyTorch version: 2.5.1 - CUDA device(s): Tesla T4, Tesla T4, Tesla T4, Tesla T4 -…

hitzkrieg updated 2 days ago
1
FedML-AI/FedCV #36

Problems of distributed computing in federated learning

When using distributed operation, I have four Gpus, each of which has a client. During the training process, each GPU has a huge difference. Two gpus even ran out of memory. By the way, I also found t…

rG223 updated 2 years ago
1
PaddlePaddle/PaddleNLP #9211

[Bug]: 基于Paddle 3.0.0b1 使用PaddleNLP，执行llama 7B 报错,AttributeE…

### 软件环境 ```Markdown - paddlepaddle: - paddlepaddle-gpu: 3.0.0b1 - paddlenlp: https://github.com/ZHUI/PaddleNLP/tree/sci/benchmark ``` ### 重复问题 - [X] I have searched the existing issues ### 错误描…

shang-mt updated 1 month ago
1
pentium3/sys_reading #227

ElasticFlow: An Elastic Serverless Training Platform for Dis…

pentium3 updated 9 months ago
1
pytorch/torchtune #2071

CUDA OOM error with supposedly good enough specs according t…

Hello! I am currently trying to fine tune using lora a Llama 3.1 70B Nemotron Instruct LLM by tweaking a bit the Llama 3.1 70B lora configs. According to the memory stats required by torchtune, …

sionhan updated 3 days ago
4

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for distributed-learning

1000+ results
for distributed-learning