Open hanwen-sun opened 3 months ago
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
ci中clip_grad相关的单测没有通过,需要再调试一下
CI failed when running job: cuda-misc. PR label automerge has been removed
CI failed when running job: cuda-module. PR label automerge has been removed
CI failed when running job: cuda-module. PR label automerge has been removed
CI failed when running job: cuda-module. PR label automerge has been removed
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
View latest API docs preview at: https://oneflow-staging.oss-cn-beijing.aliyuncs.com/docs/Oneflow-Inc/oneflow/pr/10443/
该pr目前仍存在一个问题: clip_grad的1n2d的测试通不过, 我在相同的硬件设备(26, 28机器)上使用与ci环境相同的docker, 并使用该pr编译好的whl, 依旧无法复现ci中的问题。
去掉clip_grad 范数计算中的第一个to_global, 以减少在tensor parallel情况下不必要的 all gather