-
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See er…
-
### What happened?
It seems like when letting PySR run forever after a while it gets an OOM error after a while, but only when using a large dataset. I can watch it steadily grow memory.
I used th…
-
### Context
We'd like to scale conda store workers up to allow many solves at the same time. See [here](https://github.com/nebari-dev/nebari/issues/2284) for more info. However, the conda store w…
-
**Describe the issue**:
Registering a WorkerPlugin subclass with `Client.register_plugin` raises an error. `Client.register_worker_plugin` works fine, but warns of deprecation.
My real use c…
-
### 🐛 Describe the bug
I train the llama2-70b model in 4*8(80G) H100 with "gemini" plugin, Can train normally, but an “OOM” error occurs when saving the model.
here is the log
`Epoch 0: 0%| …
-
With Llama38b, inference works, however api and chat do not. It produces a segmentation fault.
```shell
sudo nice ./dllama chat --model models/llama3_8b_instruct_q40/dllama_model_llama3_8b_instruc…
-
## 🐛 Bug
Was trying to launch a distributed job with 2 nodes each with 4GPU using fairseq-hydra-train. Single node multigpu using fairseq-hydra-train without `torch.distributed.run` can run success…
hannw updated
3 years ago
-
**What would you like to be added**: A method to provide automatically provisioned ReadeWriteMany PVs that are available on all workers.
Currently the storage provisioner that is being used can on…
-
### 🐛 Describe the bug
Fully Sharded Data Parallel (FSDP) is a wrapper for sharding module parameters across data parallel workers. It supports various sharding strategies for distributed training …
-
Huawei’s OPEN submission utilized local SSDs on the benchmark hosts, with an aggregated capacity of 12TiB per node.
The submission indicates that 7 accelerators (H100, 3DUnet) were simulated per be…