-
### 🐛 Describe the bug
dear, this is my code.
```
....
tensor = torch.rand(1, 1, 4096, dtype=torch.float16)
torch.distributed.all_reduce(tensor)
```
When I run this code, I get an error.
```
…
-
## Intro
Modern applications are distributed systems composed of numerous services that handle high volumes of requests to the application. Oftentimes, multiple services are involved in handling a …
-
**Description**
Option to force synchronization to the storage device. Since the only guarantee from the OS when writing files is that the OS has accepted the change it would be nice allow forcing sy…
-
### Describe the bug
The long-lived circuits of Blazor server make distributed tracing not work as expected.
Since each circuit is effectively a long-lived request ... a lot of *activity* (pun i…
-
I have a .NET 7 Function App (Isolated Worker) that has Application Insights setup using the same instructions [documented here](https://learn.microsoft.com/en-us/azure/azure-functions/dotnet-isolated…
-
The [Build/install](https://docs.arbor-sim.org/en/latest/install/build_install.html) page should explain how to run Python exension tests.
Running pytest in any of the directories test/, test/unit_…
-
I prepared 1000 images and ran a training test. I set the max steps to 1000 and the training finished in 6 minutes, but the result is very cool!
Do you have any tips for running training?
![ima…
-
I have tried with my ubuntu 22.04 OS but it gives following error.
E0724 19:33:34.565000 128818126430656 torch/distributed/elastic/multiprocessing/api.py:702] failed (exitcode: -9) local_rank: 0 (p…
-
### Describe the bug
Thank you for your amazing work. It seems like models are not saved or loaded properly after finetuning train_custom_diffusion.py in a new dataset. Generated validation images ar…
-
### Describe the bug
This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.
### Reproduction
Run `accelerate config`
```
comp…