-
It would be useful for the RAPIDS effort to have a multi-node join computation deployed from Kubernetes. Until UCX arrives this will likely be slow, but we can probably work on deployment and configu…
-
We have an account holder who is running out of wall clock time, which is generously set to 48 hours on our cluster, even when using a 128-core node with the multi-threading option turned on. Since th…
-
It seems that it tries to connect to the node that "used" to run the task to deregister the service from it.
What if you run a multi-master cluster consul setup and with local agents running on each …
-
## Feature Request
### Description
Load balancers are ubiquitous in cloud environments but not standardized and manual work in on-premise setups. Hence letting Talos handle this requirement inte…
-
We've been exploring use of this library in a Kubernetes (K8s) environment, but the choice of Sqlite as a back end is possibly preventing that use:
- Scaling in a K8s environment involves using mu…
-
Hello, thank you for reproducing the work of the paper.
I tried to launch the training of the model on the MSN-Hard dataset, but I'm unable to launch the training because a CUDA_ERROR_OUT_OF_MEMORY…
-
Hi!
Currently, I'm trying to setup this S3 driver for my volumes. To do that, I've first installed this driver through the helm chart, and then installed [this FTP server chart](https://github.com/sj…
-
### Modin version checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest released version of Modin.
- [ ] I have confirmed t…
-
Jira Link: [DB-5433](https://yugabyte.atlassian.net/browse/DB-5433)
### Description
While comparing table limits between YB-colocated database Vs YB-normal database, observed high CPU utilisation on…
-
**Describe the bug**
DeepSpeed ZeRO++ features aren't working:
1. On a single node, passing `zero_hpz_partition_size` , `zero_quantized_gradients` , `zero_quantized_weights` leads to foward pass err…