IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
202 stars 38 forks source link

Aborted (core dumped) #56

Open vshkurin opened 2 years ago

vshkurin commented 2 years ago

Hello! You are doing a very cool project that helps ordinary users to solve the problem with the lack of memory in the video card. But unfortunately my neural network training ends up with an error.

Sometimes this error is:

F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:136] Check failed: this->H2Dstream->ok() Aborted (core dumped)

Sometimes:

F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:175] Check failed: this->D2Hstream->ok() Aborted (core dumped)

Sometimes errors appear immediately, sometimes in the middle of epoch.

I am using LMS with tensorflow 2.2.0.

smatzek commented 2 years ago

Thanks for trying out LMS. LMS is not being actively maintained, hence the 2.2.0 version. It has been nearly 2 years since I ran a job with LMS, but I recall that those types of errors would come out during some types of out of memory errors.

Even with LMS enabled it is possible to run out of memory on the GPU in certain cases such as when the model requires too many active tensors, or individual operations have input and output tensors so large they cannot fit.

If you provide more error messages / output from your call I may be able to help more.

vshkurin commented 2 years ago

Thank you for the clarification! I am using a GeForce GTX 1060 3GB and this card probably has very little video memory. I saw that if I reduce the batch size, the error disappears. This is all that I can write to you, except for the above, nothing appears in the console.

HarshaanNiles010 commented 1 year ago

Were you able to reason out why sometimes the error would come at the beginning of the process. Correct me if I am wrong if there's a large batch size then it would start the computation on the GPU and when a tensor is no longer in use it should be sent back to the host machine for storage. A lot like paging algorithm so it should not be problem to handle large batch sizes