-
/usr/local/cuda-11.7/bin/nvcc -I/home/project/machineLearn/CPM-Bee/venv/lib/python3.10/site-packages/torch/include -I/home/project/machineLearn/CPM-Bee/venv/lib/python3.10/site-packages/torch/include…
-
我用cuda extention 的方式添加了一个op,用bmtrain框架跑会报OOM,应该是ZeRO没有起效,请问这个问题怎么解决?
-
请问 adam和adam_offload 有计划支持 bf16 么?谢谢
-
![image](https://user-images.githubusercontent.com/72364657/224590366-70000f5a-4f0c-4938-914a-c94bb4295513.png)
-
https://github.com/OpenBMB/ModelCenter/blob/main/examples/cpm2/pretrain_cpm2.py#L24
请问这里模型初始化是不是每卡都会执行?
如果模型很大,可能内存OOM。谢谢您的解答。
-
cuda11.6
python 3.10
In file included from /tmp/pip-install-docs1q76/bmtrain_cd370155ab894fa5969602d5d8168e37/csrc/nccl.cpp:4:0:
/opt/conda/envs/viscpm/lib/python3.10/site-packages/torch/inc…
-
### System Info
```
python -x examples/Aquila/Aquila-chat/aquila_chat.py
[2023-06-10 13:32:17,187] [INFO] [logger.py:85:log_dist] [Rank -1] Unsupported bmtrain
args list is ['--IGNORE_INDEX', …
-
你好,python3.10安装bmtrain,显示安装成功,但是import bmtrain时,报这个错误:/bmtrain/nccl/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: ncclBroadcast。torch版本是2.0.0
-
error massage: csrc/adam_cpu.cpp: 158:27 error const class at::tensor has no member named is_cpu
-
### Description
pip install 或者用github 源码安装都是一样的错
![image](https://github.com/FlagAI-Open/FlagAI/assets/4593091/b9f63b16-5e09-4237-bbce-eb1854311a83)
### Alternatives
_No response_