distributed-machine-learning Search Results

1000+ results
for distributed-machine-learning

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

huggingface/transformers #34467

Assert error in convert_llava_onevision_weights_to_hf.py

### System Info - `transformers` version: 4.46.0 - Platform: Linux-5.15.0-97-generic-x86_64-with-glibc2.35 - Python version: 3.12.3 - Huggingface_hub version: 0.26.1 - Safetensors version: 0.4.…

FuryMartin updated 4 days ago
20
huggingface/alignment-handbook #22

How to perform full parameter finetuning without A100 GPUs

Hi, thank you for your great work! I'd like to reproduce full parameter fine-tuning of dpo training. However I only have 10 * Nvidia A40 GPUs (46 Gbs memory each). I tried the command `CUDA_VI…

ChenDRAG updated 9 months ago
13
pytorch/pytorch #105840

[FSDP] FSDP doesn't work (random accuracy performance) when …

### 🐛 Describe the bug Currently, when using FSDP, the model is loaded for each of the N processes completely on CPU leading to huge CPU RAM usage. When training models like Flacon-40B with FSDP on…

pacman100 updated 15 hours ago
19
huggingface/accelerate #3168

dataloader doesn't load data while gpu is training

### System Info ```Shell Copy-and-paste the text below in your GitHub issue - `Accelerate` version: 1.0.0 - Platform: Linux-6.10.11-amd64-x86_64-with-glibc2.40 - `accelerate` bash location: /dis…

geekifan updated 2 weeks ago
4
DaloroAT/first_breaks_picking #35

Python > 3.10

Hi, Are there any reasons why it doesn't work with python > 3.10? For example when trying to run `pip install first-breaks-picking-gpu` I get error: ``` ERROR: Ignored the following versions tha…

kerim371 updated 1 year ago
8
opensearch-project/skills-eval #7

bert_score-0.3.13-py3-none-any.whl: 12 vulnerabilities (high…

Vulnerable Library - bert_score-0.3.13-py3-none-any.whl Path to dependency file: /packages/bert/requirements.txt Path to vulnerable library: /packages/bert/requirements.txt Found in HEAD commit:…

mend-for-github-com[bot] updated 3 months ago
1
pytorch/pytorch #121594

[DDP] Gradient Synchronization Failure Induced by model.grad…

### 🐛 Describe the bug Hello, when I am using DDP to train a model, I found that using multi-task loss and gradient checkpointing at the same time can lead to gradient synchronization failure betwe…

1azybug updated 5 months ago
1
jupyterlab/jupyterlab #7506

help-extension: Enable allow-same-origin to fix broken searc…

## Description Relatively minor, but explicitly omitting `allow-same-origin` from the help widget iframe `sandbox` attribute in packages/help-extension breaks search pages on many reference documen…

jkromwijk updated 1 year ago
3
microsoft/unilm #1180

TextDiffuser - When does the model starts to predict plausib…

**Describe** Model I am using : TextDiffuser Hi, thanks for the great work. I'm trying to train the model on the portion of Mario-Laion image dataset (~50k images). But currently the images generat…

other-ones updated 1 year ago
12
pytorch/pytorch #49440

[RFC] DataLoader architecture updates and TarDataset impleme…

# DataLoader architecture updates and TarDataset implementation # Problem statement This proposal aims to construct a modular, user-friendly, and performant toolset to address the ambiguous activi…

VitalyFedyunin updated 3 years ago
50

上一页 1...88 89 90 91 92 93 94...100 下一页

1000+ results for distributed-machine-learning

1000+ results
for distributed-machine-learning