-
## Bug Description
I am running a distributed Linear model (20 parameters) across 2 GPU Nodes, each node having 2 NVIDIA H100 NVL GPUs. The Model uses DDP parallelization strategy. I am generating…
-
### Describe your problem
![image](https://github.com/user-attachments/assets/b3e4d2ed-b140-44bb-92c3-479c8d78008a)
![image](https://github.com/user-attachments/assets/456426c7-5d64-48a1-969d-d9453d…
-
I have a device containing 4 Nvidia L40 GPUs. I am trying to use the full_finetune_distributed llama3_1/8B_full recipe. My configuration for dataset in the config file is given below:
dataset:
_c…
-
### Is there an existing issue for this?
- [X] I have searched the existing issues
### Is your feature request related to a problem? Please describe.
I would like to request two features that…
-
JIRA Issue: [KIEKER-564] Monitoring and analysis for large-scale distributed/cloud-based systems
Original Reporter: Andre van Hoorn
***
To be polished
Brief explanation:
(Max. 5-7 sentences)
As…
-
### Bug description
I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns `[rank7]: torch.dist…
-
See https://github.com/eclipse-hono/hono/issues/3425
-
# Summary
Some of the configuration shouldn't be centrally managed, as different user groups (linked to projects) may want to define them on their own.
# Motivation
We should make the distinc…
-
#### Describe the bug
I'm evaluating mimir-distributed in high availability mode to determine its reliability when one of the nodes is offline. Following a series of bring-ups and down operations, …
-
# Module Request
Note: Please try setting up a configuration yourself before raising an issue to request a configuration: ~~https://config.getamp.sh/~~
***There is a newer beta version available …