distributed-datasets Search Results

1000+ results
for distributed-datasets

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

pytorch/pytorch #135355

torchrun nccl Multi machine and multi card training error: s…

### 🐛 Describe the bug torchrun Multi machine and multi card training error. Both Rank1 and Rank2 can be trained normally. The error occurs after the successful establishment of nccl communicatio…

sph116 updated 1 week ago
4
jxiw/MambaInLlama #9

Training Slowdown for Llama3-Mamba2

Hello! I am training the first two knowledge distillation stages of Mamba 2 on one DGX-H100x8 node, and I am experiencing train times of ~8 hours for the first stage, and ~13 hours for the second stag…

Codys12 updated 1 week ago
13
FedML-AI/FedML #187

Unable to run MNIST experiments. run_fedavg_distributed_pyto…

I am experimenting with the tutorial below - https://github.com/FedML-AI/FedML/blob/master/fedml_experiments/distributed/fedavg/README.md Run the following shell and dump the arguments to get. ```b…

xlw686 updated 11 months ago
7
pydata/xarray #4804

Improve performance of xarray.corr() on big datasets

**Is your feature request related to a problem? Please describe.** I calculated correlation coefficients based on datasets with sizes between 90-180 GB using xarray and Dask distributed and experie…

kathoef updated 3 years ago
9
NVIDIA/Megatron-LM #937

[BUG]Get an AtrributeError when trying to finetune llama3-8B…

**Describe the bug** I try to finetune `llama3-8B` model with multi nodes but get an AtrributeError when finishing loading mcore format checkpoint and starting to build datasets, the error is below: …

nakroy updated 1 month ago
4
pytorch/torchtune #1493

Some NCCL operations have failed or timed out. Due to the as…

Hey, I have seen the previous issues. Based on that I tracked down the approximate lane where the pipeline is struck which is the setup function where it failed to load the model. The training is n…

Vattikondadheeraj updated 20 hours ago
5
pydata/xarray #5043

open_mfdataset failed to open tarfile filestream when it loc…

Recently, i use `open_mfdataset` to open a local tar.gz file of multiple netcdf files, it failed to open it and raise a `distributed.scheduler.KilledWorker: Error` and `TypeError: cannot serialize…

wqshen updated 3 years ago
1
NVIDIA/nccl #952

NCCL INFO NET/IB : No device found

(Raised a similar [issue](https://github.com/NVIDIA/Megatron-LM/issues/450) in the Megatron repo, but I think it might be more appropriate here, so I'm adding more details) I am trying to run Mega…

sedrick-keh-tri updated 5 months ago
5
huggingface/diffusers #9497

Dreambooth Flux training error: RuntimeError: mat2 must be a…

### Describe the bug I run the training but get this error ### Reproduction Run `accelerate config` ``` compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: '…

kopyl updated 58 minutes ago
1
hyperspy/rosettasciio #61

Adding Documentation About Dask-Distributed Support for file…

#### Describe the functionality you would like to see. I would like to add to the documentation information about which file loaders support the `dask-distributed` backend. Mostly just add an extra …

CSSFrancis updated 1 year ago
1

上一页 1...20 21 22 23 24 25 26...100 下一页

1000+ results for distributed-datasets

1000+ results
for distributed-datasets