-
**Describe the issue**:
I get a `RuntimeError` with some dask-ml code that worked with dask / distributed 2024.10.0 and earlier. With 2024.11.0 and newer, it fails:
**Minimal Complete Verifiabl…
-
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by runn…
-
I met a quite quirky issue. I used 2 p4d.24xlarge (8xA100) in AWS to train my model. The bash code first download data and only when data finishes downloading, does the training process starts by runn…
-
@kaviecos and @mihow have designed & written the specifications for a new ML backend that orchestrates multiple types of models by different research teams, across multiple stages of processing, and i…
mihow updated
3 months ago
-
When running the [FSDP sample app](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/README-EKS.md) on HyperPod EKS cluster, I got this error.
```
[W CUDAFu…
-
I wonder how much overhead does soperator introduces in ML, compared with **native slurm**. This is an important concern and I want to know if you have any statistics.
## Some scenarios
### Sing…
-
Hi,
If in the web demo the stop button is pressed mid-generation, the app terminates.
```
*** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: 'NSWindow sho…
-
### System Info / 系統信息
cuda11.8
x2 3090
linux ubuntu 22.04 lts
pytorch2.4
### Information / 问题信息
- [X] The official example scripts / 官方的示例脚本
- [X] My own modified scripts / 我自己修改的脚本和任务
###…
D-Mad updated
2 weeks ago
-
I am getting this error.
```
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
```
```
Traceback (most recent call last):
File "/usr/local/lib/python3.10/multiproc…
-
When executing it, I encountered the following error :
Can someone help me pls?
****Code** :**
CROSS_VALIDATION_FOLDS = 10
POLYNOMIAL_FEATURES_DEGREE = 2
# Create train and test Snowpark D…