-
## 🐛 Bug
I ordered my training data in a specific manner and passed it to the DataLoader with `shuffle=False` (I use `reload_dataloaders_every_n_epochs=1` to control it every epoch). Then, I found ou…
-
# Motivation
Shuffles are an integral part of many distributed data manipulation algorithms. Common DataFrame operations relying on shuffling include `sort`, `merge`, `set_index`, or various groupb…
-
## Environment
- mosaicml-streaming==0.7.5
## To reproduce
Steps to reproduce the behavior:
1. Use `StreamingDataset` in distributed training with the same seed and set `replication` either …
-
Shuffle is a key workload for stressing Ray core's distributed dataplane. For large datasets, it requires all-to-all communication and spilling to disk. Thus, shuffle stresses the object transfer and …
-
**When I am trainning on the custom datasets, encounters an error at the beginning of epoch 72.**
nohup: ignoring input
W1111 14:07:33.188000 2366719 site-packages/torch/distributed/run.py:793]
…
-
## 🚀 Feature
**Motivation**
* To avoid pitfall with shuffling and sharding of datapipes in distributed training environments
* To ensure consistent experience of TorchData based datasets ac…
-
**Describe the bug**
The log settings defined by `logging_on()`, and therefore by any trollflow2 process, are not inherited by tasks scheduled using dask.distributed when called inside an `if __nam…
-
Hello,
We successfully fine-tuned the Mistral7b_v0.3 Instruct model using a single GPU, but we encountered issues when trying to utilize multiple GPUs.
The successful fine-tuning with one GPU (A…
-
The documentation for the shuffle parameter in the dask.DataFrame.set_index method says:
- "Either 'disk' for single-node operation or 'tasks' for distributed operation. Will be inferred by your curr…
-
### Problem Description
On Llama3 70B Proxy Model, the training stalls & gpucore dumps. The gpucore dumps are 41GByte per GPU thus i am unable to send it. Probably easier for yall to reprod this er…