Social-Evolution-and-Behavior / anTraX

anTraX: high throughput tracking of color-tagged insects
https://antrax.readthedocs.io/
GNU General Public License v3.0
17 stars 3 forks source link

Running antrax on HPC / `MATLAB:badsubscript` and related errors during solve #20

Closed janamach closed 3 years ago

janamach commented 3 years ago

Hi :-)

The HPC server I am using has certain limits per user (100 schedules jobs, 3 days max per job). The solve step in my case generated more than 100 jobs, causing some of the jobs getting cancelled. Since the solve step is a three step process, I figured I can start each step manually, e.g.: $ sbatch path/to/hpc_solve1.sh Is this a reasonable way to do it? Is there a different way to go around the max job per user thing?

asafgal commented 3 years ago

My bad, the syntax should be solve_across_movies(Trck, 3);

On 18 Apr 2021, at 22:29, Jana Mach @.***> wrote:

Sorry, that was silly of me :-/ With the dataset that had the problem:

Trck = trhandles('.'); 21:25:09 -I- Loading tracking session from expdir 21:25:17 -I- Reading video information from file solve_across_movies(Trck, 'g', 3); Error using solve_across_movies (line 11) Expected a string scalar or character vector for the parameter name.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Social-Evolution-and-Behavior/anTraX/issues/20#issuecomment-822046977, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACERP5QWJOL7HAZMVE3S32TTJMXIBANCNFSM42OSYFEQ.

janamach commented 3 years ago
>> solve_across_movies(Trck, 3);
21:45:08 -I- solving graph from movies 25-36
21:45:08 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
21:45:37 -I- Loading trgraph from antrax/graphs/graph_26_26.mat
21:45:55 -I- Loading trgraph from antrax/graphs/graph_27_27.mat
21:46:11 -I- Loading trgraph from antrax/graphs/graph_28_28.mat
21:46:24 -I- Loading trgraph from antrax/graphs/graph_29_29.mat
21:46:31 -I- Loading trgraph from antrax/graphs/graph_30_30.mat
21:46:37 -I- Loading trgraph from antrax/graphs/graph_31_31.mat
21:46:42 -I- Loading trgraph from antrax/graphs/graph_32_32.mat
21:46:49 -I- Loading trgraph from antrax/graphs/graph_33_33.mat
21:46:57 -I- Loading trgraph from antrax/graphs/graph_34_34.mat
21:47:02 -I- Loading trgraph from antrax/graphs/graph_35_35.mat
21:47:06 -I- Loading trgraph from antrax/graphs/graph_36_36.mat
21:47:08 -I- Finished loading trgraph with 80451 tracklets
21:47:12 -I- Loading ids
21:47:31 -I- Finding single ant nodes
21:47:33 -I- Some preperations
21:47:38 -I- Filtering out tracklets identified as non-ant
21:47:38 -I- ...1082 tracklets classified as no-ant were filtered
21:47:39 -I- ...13588 short, unconnected and unidentified tracklets were filtered
21:47:41 -I- Propagating ids from src tracklets
21:47:45 -I-     ...finished 1000/25235
21:47:45 -I-     ...finished 2000/25235
21:47:45 -I-     ...finished 3000/25235
21:47:45 -I-     ...finished 4000/25235
21:47:45 -I-     ...finished 5000/25235
21:47:45 -I-     ...finished 6000/25235
21:47:45 -I-     ...finished 7000/25235
21:47:45 -I-     ...finished 8000/25235
21:47:45 -I-     ...finished 9000/25235
21:47:45 -I-     ...finished 10000/25235
21:47:45 -I-     ...finished 11000/25235
21:47:45 -I-     ...finished 12000/25235
21:47:45 -I-     ...finished 13000/25235
21:47:45 -I-     ...finished 14000/25235
21:47:45 -I-     ...finished 15000/25235
21:47:45 -I-     ...finished 16000/25235
21:47:45 -I-     ...finished 17000/25235
21:47:45 -I-     ...finished 18000/25235
21:47:45 -I-     ...finished 19000/25235
21:47:45 -I-     ...finished 20000/25235
21:47:45 -I-     ...finished 21000/25235
21:47:46 -I-     ...finished 22000/25235
21:47:46 -I-     ...finished 23000/25235
21:47:46 -I-     ...finished 24000/25235
21:47:46 -I-     ...finished 25000/25235
21:47:46 -I- Propagation loops
Index in position 1 exceeds array bounds.

Error in trgraph/solve>propagate_all (line 522)
            score = G.assignment_scores(assigned_nodes(i),idix(j));

Error in trgraph/solve (line 150)
propagate_all(G);

Error in solve_across_movies (line 72)
solve(G,false,true);

>> 
asafgal commented 3 years ago

You don't seem to have my code changes. Did you pull my latest commit?

asafgal commented 3 years ago

BTW, did you re-track your videos with different parameters? You have 80K tracklets in the latest run, while in the previous run you had 17K for the same videos.

janamach commented 3 years ago

hmm, I was on the wrong branch. Sorry :-/

BTW, did you re-track your videos with different parameters?

I have two datasets, both containing 60 hours of videos, but significantly different in tracklet numbers. Both datasets had issues with step2, their parameters are not identical, and their classifiers are different.

Dataset-1:

22:08:20 -I- solving graph from movies 25-36
22:08:20 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
22:08:50 -I- Loading trgraph from antrax/graphs/graph_26_26.mat
22:09:09 -I- Loading trgraph from antrax/graphs/graph_27_27.mat
22:09:24 -I- Loading trgraph from antrax/graphs/graph_28_28.mat
22:09:37 -I- Loading trgraph from antrax/graphs/graph_29_29.mat
22:09:45 -I- Loading trgraph from antrax/graphs/graph_30_30.mat
22:09:50 -I- Loading trgraph from antrax/graphs/graph_31_31.mat
22:09:56 -I- Loading trgraph from antrax/graphs/graph_32_32.mat
22:10:02 -I- Loading trgraph from antrax/graphs/graph_33_33.mat
22:10:10 -I- Loading trgraph from antrax/graphs/graph_34_34.mat
22:10:15 -I- Loading trgraph from antrax/graphs/graph_35_35.mat
22:10:19 -I- Loading trgraph from antrax/graphs/graph_36_36.mat
22:10:21 -I- Finished loading trgraph with 80451 tracklets
22:10:25 -I- Loading ids
22:10:45 -I- Finding single ant nodes
22:10:47 -I- Some preperations
22:10:51 -I- Filtering out tracklets identified as non-ant
22:10:51 -I- ...1082 tracklets classified as no-ant were filtered
22:10:52 -I- ...13588 short, unconnected and unidentified tracklets were filtered
22:10:54 -I- Propagating ids from src tracklets
22:10:58 -I-     ...finished 1000/25235
22:10:58 -I-     ...finished 2000/25235
22:10:58 -I-     ...finished 3000/25235
22:10:58 -I-     ...finished 4000/25235
22:10:58 -I-     ...finished 5000/25235
22:10:58 -I-     ...finished 6000/25235
22:10:58 -I-     ...finished 7000/25235
22:10:58 -I-     ...finished 8000/25235
22:10:58 -I-     ...finished 9000/25235
22:10:58 -I-     ...finished 10000/25235
22:10:58 -I-     ...finished 11000/25235
22:10:58 -I-     ...finished 12000/25235
22:10:58 -I-     ...finished 13000/25235
22:10:58 -I-     ...finished 14000/25235
22:10:58 -I-     ...finished 15000/25235
22:10:58 -I-     ...finished 16000/25235
22:10:58 -I-     ...finished 17000/25235
22:10:58 -I-     ...finished 18000/25235
22:10:58 -I-     ...finished 19000/25235
22:10:58 -I-     ...finished 20000/25235
22:10:59 -I-     ...finished 21000/25235
22:10:59 -I-     ...finished 22000/25235
22:10:59 -I-     ...finished 23000/25235
22:10:59 -I-     ...finished 24000/25235
22:10:59 -I-     ...finished 25000/25235
22:10:59 -I- Propagation loops
22:10:59 -E- error in propagate_all
22:10:59 -I- node is 1
22:10:59 -I- size of assignment scores is 0  0
22:10:59 -I- size of assignment ids is 80451     76
Index in position 1 exceeds array bounds.

Error in trgraph/solve>propagate_all (line 523)
                score = G.assignment_scores(assigned_nodes(i),idix(j));

Error in trgraph/solve (line 150)
propagate_all(G);

Error in solve_across_movies (line 72)
solve(G,false,true);

>> 

Dataset-2:

>> solve_across_movies(Trck, 3);
22:15:29 -I- solving graph from movies 25-36
22:15:29 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
22:15:37 -I- Loading trgraph from antrax/graphs/graph_26_26.mat
22:15:39 -I- Loading trgraph from antrax/graphs/graph_27_27.mat
22:15:40 -I- Loading trgraph from antrax/graphs/graph_28_28.mat
22:15:42 -I- Loading trgraph from antrax/graphs/graph_29_29.mat
22:15:42 -I- Loading trgraph from antrax/graphs/graph_30_30.mat
22:15:43 -I- Loading trgraph from antrax/graphs/graph_31_31.mat
22:15:43 -I- Loading trgraph from antrax/graphs/graph_32_32.mat
22:15:43 -I- Loading trgraph from antrax/graphs/graph_33_33.mat
22:15:44 -I- Loading trgraph from antrax/graphs/graph_34_34.mat
22:15:45 -I- Loading trgraph from antrax/graphs/graph_35_35.mat
22:15:50 -I- Loading trgraph from antrax/graphs/graph_36_36.mat
22:15:56 -I- Finished loading trgraph with 17914 tracklets
22:15:57 -I- Loading ids
22:16:00 -I- Finding single ant nodes
22:16:00 -I- Some preperations
22:16:01 -I- Filtering out tracklets identified as non-ant
22:16:01 -I- ...8544 tracklets classified as no-ant were filtered
22:16:01 -I- ...6359 short, unconnected and unidentified tracklets were filtered
22:16:02 -I- Propagating ids from src tracklets
22:16:03 -I-     ...finished 1000/7421
22:16:03 -I-     ...finished 2000/7421
22:16:03 -I-     ...finished 3000/7421
22:16:03 -I-     ...finished 4000/7421
22:16:03 -I-     ...finished 5000/7421
22:16:03 -I-     ...finished 6000/7421
22:16:03 -I-     ...finished 7000/7421
22:16:03 -I- Propagation loops
22:16:03 -E- error in propagate_all
22:16:03 -I- node is 2
22:16:03 -I- size of assignment scores is 0  0
22:16:03 -I- size of assignment ids is 17914     76
Index in position 1 exceeds array bounds.

Error in trgraph/solve>propagate_all (line 523)
                score = G.assignment_scores(assigned_nodes(i),idix(j));

Error in trgraph/solve (line 150)
propagate_all(G);

Error in solve_across_movies (line 72)
solve(G,false,true);

>> 
janamach commented 3 years ago

What does this error mean exactly? I am wondering if I am causing it by using the software not as intended...

In my assay, the ants are free to come into and leave the frame. They all have access to the nest and inevitably some unmarked ants come in. I trained the classifier to recognize unmarked ants as any_ant just to separate them from the marked ants I am interested in. But any_ant should be at one place at a time, while in reality there can be more than one unmarked ants in the frame. Hence, it's very possible that at the intersection between movies the data does not look as expected.

I guess if this is the problem, I should either classify unmarked ants as NoAnt to remove them from my data or skip step 2. What would you recommend?

asafgal commented 3 years ago

No, it was a real bug, and I was able to reproduce it.

There was a variable that was assigned during solve1, but was cleared after solve2. So, when you try to run solve2 again without running solve1, it complains. However, I think this small bug only masks the problem from above, which is on the same variable. So try running all solve steps again..

Your any_ant solution is fine, but not ideal as you understand. Do you have good separation between this class and the Unknown class (marked ant that cannot be identified because of posture, bad image etc)? If so, you can try defining the any_ant class in the "NoAnt" category, which sounds weird, but all it does is telling the algorithm this is a category that cannot be individually tracked. I use it for larvae, food items etc.

janamach commented 3 years ago

So try running all solve steps again..

It worked locally with MATLAB with the dataset that has fewer tracklets! Will also test it on the other dataset on HPC, but it will probably take very long.

Do you have good separation between this class and the Unknown class (marked ant that cannot be identified because of posture, bad image etc)?

It's probably not very good right now because I did not consider this at all when I was selecting the examples. I think I will stick with any_ant ant class solution for these two datasets and try the alternative with the next experiments.

asafgal commented 3 years ago

Great, let me know.

It's probably not very good right now because I did not consider this at all when I was selecting the examples. I think I will stick with any_ant ant class solution for these two datasets and try the alternative with the next experiments.

Sounds good. You want to get to the point where the tracking pipeline works, and the performance can be easily estimated. You can then tune the pipeline accordingly, depending on what you actually need in order to do your science.

janamach commented 3 years ago

With the fist dataset (many tracklets!) I had another problem during step 2: the graphs that did not show the error, would fail because the job would run out of memory. 2CPU / 4 GB per CPU was plenty for the the second dataset with less tracklets, while this one would fail with 10GB per CPU. I increased it to 24GB per CPU and then it was running for two days, I ended up cancelling the job because I wasn't sure if it was stuck in an infinite loop. I tried running it locally too, it indeed used a lot of memory (over 30GB of RAM a couple of hours into the job on a machine with 64GB of RAM).

Is that something you would expect with a dataset with a very high number of tracklets?

janamach commented 3 years ago

It's probably worth mentioning that one of the classify jobs in from that "heavy" dataset takes about 60 hours to finish with 2 CPU / 2 GB, many other ones take more than one day. Hmmm....

asafgal commented 3 years ago

Why does your dataset have so many tracklets? What is your video duration and frame rate? I was under the impression that your tracking is expected to be very sparse. Do you have many detections of non-ant blobs?

Generally though, the classify step can benefit from more cpus, especially if the average tracklet length is longer than a few seconds. The best way to chose the resources per job is to run locally one task, and see what is the typical cpu/mem consumption.

The solve step is not resource heavy usually, not in cpu and not in memory. 30GB is something I never encountered. What is the typical file size in session/graphs/*.mat?

janamach commented 3 years ago

What is your video duration and frame rate?

60 minutes, 12 videos per subdir, @ 25fps. Cataglyphis are fast...

I was under the impression that your tracking is expected to be very sparse.

Not always.

Do you have many detections of non-ant blobs?

Some, but not many. The vast majority of detections are ants.

What is the typical file size in session/graphs/*.mat?

Highly variable, from 1MB to 70+ MB.

I will write you an email later, it will make much more sense if I explain you the experiment.

lizimai commented 2 years ago

The above was partially solved by re-running the track step for movies 25,28,35,36 on HPC. Step 2 showed the MATLAB:badsubscript error in all logs (5 graphs in total), but step 3 finished successfully and the missing mat/csv files have been generated.

$ cat matlab_solve_g_3.log 
09:26:30 -I- Reading video information from file
09:26:36 -I- solving graph from movies 25-36
09:26:36 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
09:26:48 -I- Loading trgraph from antrax/graphs/graph_26_26.mat
09:26:52 -I- Loading trgraph from antrax/graphs/graph_27_27.mat
09:26:54 -I- Loading trgraph from antrax/graphs/graph_28_28.mat
09:26:57 -I- Loading trgraph from antrax/graphs/graph_29_29.mat
09:26:57 -I- Loading trgraph from antrax/graphs/graph_30_30.mat
09:26:58 -I- Loading trgraph from antrax/graphs/graph_31_31.mat
09:26:59 -I- Loading trgraph from antrax/graphs/graph_32_32.mat
09:26:59 -I- Loading trgraph from antrax/graphs/graph_33_33.mat
09:27:00 -I- Loading trgraph from antrax/graphs/graph_34_34.mat
09:27:01 -I- Loading trgraph from antrax/graphs/graph_35_35.mat
09:27:09 -I- Loading trgraph from antrax/graphs/graph_36_36.mat
09:27:19 -I- Finished loading trgraph with 17914 tracklets
09:27:21 -I- Loading ids
09:27:25 -I- Finding single ant nodes
09:27:26 -I- Some preperations
09:27:28 -I- Filtering out tracklets identified as non-ant
09:27:28 -I- ...8544 tracklets classified as no-ant were filtered
09:27:28 -I- ...6359 short, unconnected and unidentified tracklets were filtered
09:27:29 -I- Propagating ids from src tracklets
09:27:31 -I-     ...finished 1000/7421
09:27:31 -I-     ...finished 2000/7421
09:27:31 -I-     ...finished 3000/7421
09:27:31 -I-     ...finished 4000/7421
09:27:31 -I-     ...finished 5000/7421
09:27:31 -I-     ...finished 6000/7421
09:27:31 -I-     ...finished 7000/7421
09:27:31 -I- Propagation loops
Index in position 1 exceeds array bounds (must not exceed 14008).
Error in trgraph/solve>propagate_all (line 522)

Error in trgraph/solve (line 150)

Error in solve_across_movies (line 72)

Error in antrax_mcr_interface (line 53)
MATLAB:badsubscript

Hi @janamach , I encountered similar issues as you, but on a much larger scale. I have 6 colonies per video, and 211 out of 334 videos had at least one colony failed at the solve --step 1. I am wondering, if rerun the track is the only solution you discovered so far?

janamach commented 2 years ago

Hi @lizimai . What happens when you run solve for one colony that failed during the solve step by adding the --clist option? What do the logs say for track and solve for the affected videos?

In my experience, many problems in the later steps are caused by issues during the track step and re-running track can sometimes help. Another issue that I found on my side was corrupt video files. In such cases, the track step would almost finish and exit with an error. In such cases I would re-encode the videos (or trim if possible). Unfortunately I am not aware of alternative solutions that would not involve rerunning track.

P.S. are you using the latest anTraX version? The problems I described in this issue were solved in 417223f70bc66e16fd231b6294bfb976ccadfc75 as far as I remember...

asafgal commented 2 years ago

Zimai, Jana is right - the first thing you should do is make sure the tracking step finished ok for the failed cases by looking at the corresponding logs. Can you post the errors you see in a new issue thread? This one is closed and actually very convoluted with what turned out to be many small problems.

lizimai commented 2 years ago

Hi both, thanks for the suggestion and sorry for bringing up this thread again. I will redo the track and open a new thread.

asafgal commented 2 years ago

You don’t need to rerun the track for now: look at the logs of the track step for a video that failed in solve and see if it finished ok. If you see error there, paste in the new thread. If it finished successfully, paste the error you see in the solve step log.

On 21 Feb 2022, at 10:03, Zimai Li @.***> wrote:

Hi both, thanks for the suggestion and sorry for bringing up this thread again. I will redo the track and open a new thread.