PaddlePaddle / PaddleHelix

Bio-Computing Platform Featuring Large-Scale Representation Learning and Multi-Task Deep Learning “螺旋桨”生物计算工具集
Other
982 stars 221 forks source link

HelixFold在Ultra-Long Monomer Protein Demo上推理出错 #294

Open zepingWww opened 3 months ago

zepingWww commented 3 months ago

我这边想使用最新paddlepaddle-gpu对HelixFold进行推理,按照README_inference.md进行适配,其中ppfleetx在paddle 2.6.1基础上进行升级适配,其余步骤与文档一致。

环境

问题

在适配HelixFold for Ultra-Long Monomer Protein Demo时,推理到self.evoformer这一层直接无报错信息直接crash

请问一下这层DistEmbeddingsAndEvoformer在paddle 2.6上是否有相同作用的layer的替换,或者能否对这层进行改动实现正常推理。

terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():  

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   add_ad_func(paddle::Tensor const&, paddle::Tensor const&)
1   add_ad_func(paddle::Tensor const&, paddle::Tensor const&)
2   paddle::experimental::add(paddle::Tensor const&, paddle::Tensor const&)
3   void phi::AddRawKernel<phi::dtype::bfloat16, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, phi::DenseTensor const&, int, phi::DenseTensor*)
4   phi::dtype::bfloat16* phi::DeviceContext::Alloc<phi::dtype::bfloat16>(phi::TensorBase*, unsigned long, bool) const
5   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
6   paddle::memory::allocation::Allocator::Allocate(unsigned long)
7   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
8   paddle::memory::allocation::RetryAllocator::AllocateImpl(unsigned long)
9   paddle::memory::allocation::StreamSafeCUDAAllocator::AllocateImpl(unsigned long)
10  paddle::memory::allocation::AutoGrowthBestFitAllocator::AllocateImpl(unsigned long)
11  paddle::memory::allocation::AutoGrowthBestFitAllocator::FreeIdleChunks()
12  paddle::memory::allocation::CUDAAllocator::FreeImpl(phi::Allocation*)
13  phi::enforce::EnforceNotMet::EnforceNotMet(common::ErrorSummary const&, char const*, int)
14  phi::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ExternalError: CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at ../paddle/fluid/platform/device/gpu/gpu_info.cc:269)