-
Divison is an important fundamental part of the machine learning , and it will be used to help compute some activation function or other operation for training or predicting, so I improved the DIV ope…
-
Just like the lab in #2439 this one (https://www.cloudskillsboost.google/course_sessions/2920313/labs/325084) is also part of the recommendation course (of the Professional Machine Learning Engineer P…
-
**Describe the bug**
I'm trying to use deepspeed to finetune a bert based classification model, but when trying to launch multi-node training all nodes include localhost get errno: 110 - Connection t…
-
-------------------------------------------------------------------------------------------------------------
#### Issue description
PennyLane Lightning encounters a "filesystem error" when tryi…
-
Followed one of the GCP courses briefly and came across a few deprecation warnings which I wouldn't mind fixing as I'll be following along some of the other courses as well.
This sample is taken f…
-
Not quite sure what I am writing for the intended purpose. The fact is that I have no practical experience with python and areas devoted to machine learning. Quite by chance I read about your work and…
-
-
**Describe the bug**
when fine-tuning my model using deepspeed==0.13.5, and huggingface trainer, loss and grad_norm will be nan at step 2
![image](https://github.com/microsoft/DeepSpeed/assets/29994…
-
Thanks for your work. What are versions of your cuda and cudnn?
-
Hello People,
I managed to find some error, when I tested the ES algorithm.
```
python3 -m es_distributed.main master --master_socket_path /tmp/es_redis_master.sock --algo es --exp_file configu…