DeepLink-org / deeplink.framework

BSD 3-Clause "New" or "Revised" License
59 stars 28 forks source link

fix async bug on muxi #916

Closed zhaoguochun1995 closed 2 months ago

zhaoguochun1995 commented 2 months ago

背景:沐熙设备上异步模式下跑模型时报错,只能同步模式下跑模型。 沐熙cat kernel里面分配了pin_memory 的tensor,并且用了copy算子。如果用了DIPU的allocator分配的pin_memory tensor,再用backend 为CUDA的copy,则不能正确处理tensor的生命周期,导致kernel异步计算时pin_memory的数据被修改(分给了其他tensor),导致异步运行时报错,同步时没问题 DIPU_HOST_MEMCACHING_ALGORITHM=RAW 时该问题消失 详细记录: https://aicarrier.feishu.cn/wiki/PXdYwcsjii9TJZk5AR1cG2bxnFd