distributed-training Search Results

1000+ results
for distributed-training

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

huggingface/accelerate #2865

Dataloader WeightedRandomSampler + Distributed Training

### System Info ```Shell accelerate 0.31.0 Ubuntu 22.04 (WSL) python=3.10.14 ``` ### Information - [ ] The official example scripts - [X] My own modified scripts ### Tasks - [ ] …

FrsECM updated 5 months ago
4
ContextualAI/gritlm #65

CUDA OOM when finetuning meta-llama/Meta-Llama-3-8B-Instruct

I was trying to finetuning Meta-Llama-3-8B-Instruct using 4 gpus with the following command: `torchrun --nproc_per_node 4 -m training.run --output_dir llama3test --model_name_or_path meta-llama/Met…

zhj2022 updated 2 days ago
1
sail-sg/Agent-Smith #3

Error "Tensor is not a torch image" was thrown out when run …

![image](https://github.com/user-attachments/assets/be98d5b2-f2aa-41aa-977e-15a7436f2727) Why this error came out when I run the optimize.py file? I just dismiss the distributed training.

snow-like-kk updated 1 week ago
8
pytorch/pytorch #139280

Possible bug of tools::flight_recorder

### 🐛 Describe the bug Hello, I'm a new user of PyTorch and recently tried to run the Flight Recorder code provided in the tools. But I cannot get the code to execute as expected. I use ngc 24.10…

NeXT726 updated 2 weeks ago
1
Project-MONAI/tutorials #1858

maisi_diff_unet_training_tutorial.ipynb hit random Segmentat…

``` Executing Cell 19-------------------------------------- INFO:notebook:Training the model... INFO:training:Using cuda:0 of 1 INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models. …

KumoLiu updated 1 month ago
1
rwth-i6/returnn #1482

PyTorch CUDA OOM in distributed training

``` RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb ined/2021-05-31/work/i6_core/returnn/t…

albertz updated 2 months ago
11
XPixelGroup/BasicSR #562

distributed training

your build_dataloader: if phase == 'train': if dist: # distributed training batch_size = dataset_opt['batch_size_per_gpu'] num_workers = dataset_opt['num_worke…

shawnnnkb updated 1 year ago
1
triton-lang/triton #2688

Encountering FileNotFoundError while Compiling Triton Kernel…

During the process of distributed training, I encountered the following problem when compiling Triton kernels: ``` Traceback (most recent call last): ...... File "/mnt/petrelfs/caoweihan/anaconda3…

HIT-cwh updated 1 week ago
6
tensorflow/rust #362

Distributed training

Hi there, Like tensorflow python has `tf.distribute`, what is the equivalent for the rust version? Thanks

rhobro updated 1 year ago
4
ManifoldRG/NEKO #11

Distributed Training

We have a guide on doing distributed training w/ Vast here: https://docs.google.com/document/d/1W_dN3qarCOcLRDdEZ75LBtkLGiwUziWWDtVTjd43Ad4/edit?usp=sharing . However, we have not performed full dis…

daniellawson9999 updated 1 year ago
1

上一页 1...2 3 4 5 6 7 8...100 下一页

1000+ results for distributed-training

1000+ results
for distributed-training