Error in YOLO (SAM commented)

AnandSingh-0619 commented 4 months ago

For SAM test, it did not give OUT of memory exception but the experiments ended with

srun: error: sonny: task 1: Killed
srun: Terminating StepId=921037.0
slurmstepd: error: *** STEP 921037.0 ON sonny CANCELLED AT 2024-06-15T15:38:26 ***

yolosam_fullres_100_test_nosam-ver-921037.log

yusufali98 commented 4 months ago

Does the latest commit resolve this error ? If yes, can you mention what was the issue ?

AnandSingh-0619 commented 4 months ago

Yes The issue was mainly related to data handling. The output from detector and segmentation was stored as np array leading to frequent data transfers between the CPU and GPU. This caused both the errors: 1. EOF Error and 2. CUDA out of memory error.

yusufali98 commented 4 months ago

By 1. you mean the "task killed" error right ? EOF Error is just a generic error that is always printed and it doesnt tell you anything about the actual reason for the job crash

Also, can you specify what are the main changes in the code for working around this error ?

AnandSingh-0619 commented 4 months ago

At the begining of the code execution

2024-06-25 13:57:30,666 CPU usage: 20.2% 2024-06-25 13:57:30,666 RAM usage: 51.8% 2024-06-25 13:57:30,666 Disk IO: Read 7302171323904 bytes, Written 1753960457216 bytes 2024-06-25 13:57:30,666 Network IO: Sent 1655522230662969 bytes, Received 1655499428570893 bytes 2024-06-25 13:57:30,666 Open file descriptors: 2

Towards the end

2024-06-25 14:06:30,171 CPU usage: 43.4% 2024-06-25 14:06:30,171 RAM usage: 98.8% 2024-06-25 14:06:30,171 Disk IO: Read 7304257080832 bytes, Written 1754065107968 bytes 2024-06-25 14:06:30,171 Network IO: Sent 1655524248367368 bytes, Received 1655507879648098 bytes 2024-06-25 14:06:30,171 Open file descriptors: 0

RAM usage peaks at 99.9%, which suggests the system is running out of available memory. Even the benchmark code takes more than 50% RAM. I was getting EOF Task Cancelled or Killed as system ran out of memory. Most of the cases there was some existing job running on that node which together with my job consumed 100% RAM and my job was killed or cancelled.

AnandSingh-0619 / home-robot

Error in YOLO (SAM commented) #12