Closed janamach closed 3 years ago
Matlab segmentation faults are a pain to debug :(
As it happens only on a specific partition, with the same data and code, I would say something in the machines on that partition causes them to not play nicely with matlab. It also happens mid-tracking, so my guess is a memory issue of some sort, although anTraX memory consumption is fixed during execution and doesn't "explode". There is not much I can do to debug this, and honestly I would say to go with the partition you know to work...
And yes, your settings are overkill. The tracking step is a two-thread program, one to read frames from the video, and the other to analyze them. Therefore, it never uses more than 2 cpus. Usually one thread needs to wait for the other, so you'll see usage between 100%-200% cpu. Memory usage is more variable and depends factors such as frame size, number of blobs detected, etc. 4GB per task is usually more than enough. If you want to optimize, you can check the actual memory usage (either on your local machine or your cluster) and adapt your settings.
I agree that this error is too rare to spend time to try to debug it, but it would be good to have an idea what not to do to try to avoid it. I used that same partition again for tracking, but had less tasks run in parallel and the segmentation error did not occur. I wonder if I created a bottleneck somewhere by running too many tasks at the same time.
I remember you mentioned that tracking was a two-thread thing before and somehow ignored it :-/ I guess I made many cpus idle for no reason, not good. :-/
On the up side, I did manage to go through all the steps on HPC, from installing all dependencies locally to generating csv's from my own data. This is a big achievement for me, as the whole concept of HPC was largely unknown to me up until now. Thank you again for all the help and for being so patient with me :)
I ran into an error while running tracking on multiple files on HPC and I was able to reproduce the same error by starting the job with an identical command, which was:
antrax track H1CN0304/ --movlist 1-5,13,14,21-39,43,45,49-59 --hpc --hpc-options partition=single,email=jana.mach@bio.uni-freiburg.de,cpus=10,mem-per-cpu=5000,time=09:00:00
All instances failed with a similar error (see below) twice. However, with a slightly different command all movies were tracked successfully:
antrax track H1CN0304/ --movlist 1-5,13,14,21-39,43,45,49-59 --hpc --hpc-options partition=fat,email=jana.mach@bio.uni-freiburg.de,cpus=8,mem-per-cpu=5000,time=11:00:00
The difference between the two commands was the partition used (
single
vsfat
) and the number of cpus (10
vs8
). The partition calledsingle
is usually less busy and by default I can run many more tasks in parallel. I was wondering if there was an explanation for the error with the first command? Am I overdoing it with the number of CPUs? Still trying to get a feeling for this :-)