-
Hello,
I've ran the pipeline without HTCondor up until the `processing results` part (which I assume is not currently possible without running the pipeline in HTCondor unless I write a custom scrip…
-
Hi, I am facing the error message described below while training on my RTX 4090 GPU. I've adjusted the frame number to avoid exceeding the memory limitation, and left the remaining code unchanged. How…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/OpenAccess-AI-Collective/axolotl/labels/bug) didn't find any similar reports…
-
Reporting from the `idea-pool` channel on slack, as discussed with @carmocca.
---
Hi there,
On the way to solve a OOM problem with dynamic batch sizes based on sequence length, I have just d…
-
/root/miniconda3/bin/python: can't open file 'main_simmim.py--cfg': [Errno 2] No such file or directory
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 19…
-
Data is increasing dramatically. Distributed training is a trend. I wonder if there is any plan to support this.
-
As reported by #36870, master has been broken for `USE_DISTRIBUTED=0` compile flag for a period of time. Based on the feedbacks from offline discussion, `USE_DISTRIBUTED=0` is very useful for applicat…
-
If the datasampler can be rewritten to a normal pytorch dataloader, we can more easily integrate it with other deep learning frameworks like PyTorch Lightning and Horovod. Both facilitate multi-gpu tr…
-
I have a question about Training for federated models.
I don't understand the difference between the Custom Dataset and the distributed Training for federated model.
Am I correct in assuming that th…
-
**Is your feature request related to a problem? Please describe.**
Could Cellpose use something like [SageMaker SDK
](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html)
To…