-
### 🐛 Describe the bug
we are using multi-nodes training with FSDP, and we got the following error during checkpointing through `torch/distributed/checkpoint/state_dict_saver.py`
```
File "/opt/m…
-
`sklearn.model_selection.cross_validate` fits and scores several models over some CV splits of data.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html
…
-
Hi,
i am trying to build OCamlKeepassHttp and getting the following error message:
```
$ make
ocamlbuild -use-ocamlfind -X semantic -X node_modules src-ocaml/index.byte
+ ocamlfind ocamldep -modules…
-
From [The History of Standard ML](https://dl.acm.org/doi/pdf/10.1145/3386336), footnote 12:
> Polymorphism -- The ML/LCF/Hope Newsletter was self published by Cardelli and MacQueen at Bell Laboratori…
-
**Describe the bug**
While all_reduce_grads defined as per the documentation example
```
def all_reduce_grads(grads):
N = mx.distributed.init().size()
if N == 1:
return tree_m…
-
There have been some questions as to what "SLSA for ML" looks like. This issue attempts to give a short synopsis so that we can hopefully agree and turn that into durable documentation.
First, Mach…
-
This graph is not parallel. It's an incremental, serial reduction. Each reducer requires the previous reducer to finish before it can run. I've set up the tasks so that reducers are significantly slow…
-
### Problem
Hi, Everyone. I have encountered some problem about pytorch ddp on single node multiple gpus.
My setting is follow as:
```python
os.environ["MASTER_PORT"] = "9999"
os.environ["CUDA_…
-
**Build scan:**
https://gradle-enterprise.elastic.co/s/y37ifjzcebahk/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.BasicDistributedJobsIT/testFailOverBasics
**Re…
-
### What you would like to be added?
Since @andreyvelich commented:
> Unfortunately, we don't have good docs right now about our ElasticPolicy: [https://github.com/kubeflow/training-operator/bl…