-
I am currently using Horovod for model training. The communication of the underlying gradient synchronization uses nccl. The problem of slow nodes will appear during the training process. Is there any…
-
Starting a new issue in reference to question: (https://github.com/astooke/Synkhronos/issues/11#issuecomment-326628646)
I have not experimented with running Synkhronos multi-node. Currently it's o…
-
Hi, I want to run one LLM model using multiple machines.
On one node, I want to use tensor parallel to speedup.
Within multiple nodes, I want to use pipeline parallel.
Is this supported? If s…
-
Machines
- dual 4090 ada
- dual A4500
- single A6000
- single A4000
- single 3500 Ada
Concentrate on A6000 and A4000 with 10gbps networking
- https://www.tensorflow.org/guide/distributed_trai…
-
For testing purporses, I tried deploying longhorn into a `kind` multi-node cluster.
longhorn started crashlooping, because `iscsi` isn't available.
I'm a bit confused - the docs only say:
> L…
-
Could we have @instantdb/core work on the server?
Right now, only the `@instantdb/core` supports subscriptions to queries and presence. If we could run it on the server, users could subscribe to q…
-
### Target SharePoint environment
SharePoint Online
### What SharePoint development model, framework, SDK or API is this about?
other (enter in the "Additional environment details" area below)
###…
kstat updated
2 years ago
-
Given there is already support for nccl, whats the overhead to add support for multi node gpu support for training/inference
-
https://github.com/kohya-ss/sd-scripts/blob/2a23713f71628b2d1b88a51035b3e4ee2b5dbe46/fine_tune.py#L247
I have not idea what this line is used for, but this unwrap DDP module so that the training …
-
#### Title
Automate primaryClusterEndPoint configuration in multicluster CIS
#### Description
In a multi-kubernetes cluster where there is no direct pod-to-pod communication between the clusters,…