Closed AnandSingh-0619 closed 4 months ago
Does the latest commit resolve this error ? If yes, can you mention what was the issue ?
Yes The issue was mainly related to data handling. The output from detector and segmentation was stored as np array leading to frequent data transfers between the CPU and GPU. This caused both the errors: 1. EOF Error and 2. CUDA out of memory error.
By 1. you mean the "task killed" error right ? EOF Error is just a generic error that is always printed and it doesnt tell you anything about the actual reason for the job crash
Also, can you specify what are the main changes in the code for working around this error ?
At the begining of the code execution
2024-06-25 13:57:30,666 CPU usage: 20.2% 2024-06-25 13:57:30,666 RAM usage: 51.8% 2024-06-25 13:57:30,666 Disk IO: Read 7302171323904 bytes, Written 1753960457216 bytes 2024-06-25 13:57:30,666 Network IO: Sent 1655522230662969 bytes, Received 1655499428570893 bytes 2024-06-25 13:57:30,666 Open file descriptors: 2
Towards the end
2024-06-25 14:06:30,171 CPU usage: 43.4% 2024-06-25 14:06:30,171 RAM usage: 98.8% 2024-06-25 14:06:30,171 Disk IO: Read 7304257080832 bytes, Written 1754065107968 bytes 2024-06-25 14:06:30,171 Network IO: Sent 1655524248367368 bytes, Received 1655507879648098 bytes 2024-06-25 14:06:30,171 Open file descriptors: 0
RAM usage peaks at 99.9%, which suggests the system is running out of available memory. Even the benchmark code takes more than 50% RAM. I was getting EOF Task Cancelled or Killed as system ran out of memory. Most of the cases there was some existing job running on that node which together with my job consumed 100% RAM and my job was killed or cancelled.
For SAM test, it did not give OUT of memory exception but the experiments ended with
yolosam_fullres_100_test_nosam-ver-921037.log