Social-Evolution-and-Behavior / anTraX

anTraX: high throughput tracking of color-tagged insects
https://antrax.readthedocs.io/
GNU General Public License v3.0
17 stars 3 forks source link

Running antrax on HPC / `MATLAB:badsubscript` and related errors during solve #20

Closed janamach closed 3 years ago

janamach commented 3 years ago

Hi :-)

The HPC server I am using has certain limits per user (100 schedules jobs, 3 days max per job). The solve step in my case generated more than 100 jobs, causing some of the jobs getting cancelled. Since the solve step is a three step process, I figured I can start each step manually, e.g.: $ sbatch path/to/hpc_solve1.sh Is this a reasonable way to do it? Is there a different way to go around the max job per user thing?

asafgal commented 3 years ago

This is a weird way to manage priorities.. Usually HPCs will limit the number of running jobs, or resources used, and not the number of queued jobs..

Anyhow, you can run a single step of the solve by using --step 1 (or 2 or 3). Note that you have to wait for one step to fully finish before running the next one.

Another way is to run by ‘graph’ number. A ‘graph’ in this context is a group of videos that are solved together. If you divided your experiment into few such graphs, you can use the --glist option, which accept an integer enumeration of these graphs. The default is to group by video subdirectories, but you can configure it as you like.

On 6 Apr 2021, at 14:56, Jana Mach @.***> wrote:

The HPC server I am using has certain limits per user (100 schedules jobs, 3 days max per job). The solve step in my case generated more than 100 jobs, causing some of the jobs getting cancelled. Since the solve step is a three step process, I figured I can start each step manually, e.g.:

janamach commented 3 years ago

The HPC I am using is very easy to get access to, maybe its primary purpose is training new users. I asked if they can increase my queued job quota.

Anyhow, you can run a single step of the solve by using --step 1 (or 2 or 3). Note that you have to wait for one step to fully finish before running the next one.

I tried that, it somehow didn't work:

(antrax) [fr_jm1121@uc2n994 ~]$ antrax solve H1CN0304/ --hpc --step 1 --hpc-options partition=single,email=janajg@gmail.com,cpus=4,mem-per-cpu=4000,time=24:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in H1CN0304/antrax/logs/hpc_solve1.sh

Job number 19452619 was submitted

Jobfile created in H1CN0304/antrax/logs/hpc_solve2.sh

Job number 19452620 was submitted

Jobfile created in H1CN0304/antrax/logs/hpc_solve3.sh

sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in <module>
    sys.exit(main())
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main
    """)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli
    return func('{0} {1}'.format(name, command), *args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 297, in solve
    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=3)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 258, in antrax_hpc_job
    jid = submit_slurm_job_file(jobfile, waitfor=waitfor)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 80, in submit_slurm_job_file
    jid = out.split()[-1]
IndexError: list index out of range

If I add --dry, I get a different error:

$ antrax solve H1CN0304/ --step 2 --hpc --dry --hpc-options partition=single,email=janajg@gmail.com,cpus=4,mem-per-cpu=4000,time=24:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in H1CN0304/antrax/logs/hpc_solve1.sh

Dry run, no job submitted.

Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in <module>
    sys.exit(main())
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main
    """)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli
    return func('{0} {1}'.format(name, command), *args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 293, in solve
    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=1)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 265, in antrax_hpc_job
    return jid
UnboundLocalError: local variable 'jid' referenced before assignment

But the .sh file it generated has --step 1 in it, although I asked for --step 2

As far as I understand, sbatch path/to/hpc_solve1.sh in this case should be equivalent to starting the jobs through antrax interface with --step 1, is that right?

asafgal commented 3 years ago

You’re right, there is a small bug in the interface (when running in hpc mode, anTraX ignores the step option).

There is also a small bug with the dry option in the solve step. The two bugs together explain why you see --step 1 in the job file.

Yes, you can submit the job file yourself, just update the step option.

On 6 Apr 2021, at 15:33, Jana Mach @.***> wrote:

The HPC I am using is very easy to get access to, maybe its primary purpose is training new users. I asked if they can increase my queued job quota.

Anyhow, you can run a single step of the solve by using --step 1 (or 2 or 3). Note that you have to wait for one step to fully finish before running the next one.

I tried that, it somehow didn't work:

(antrax) @. ~]$ antrax solve H1CN0304/ --hpc --step 1 --hpc-options @.,cpus=4,mem-per-cpu=4000,time=24:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in H1CN0304/antrax/logs/hpc_solve1.sh

Job number 19452619 was submitted

Jobfile created in H1CN0304/antrax/logs/hpc_solve2.sh

Job number 19452620 was submitted

Jobfile created in H1CN0304/antrax/logs/hpc_solve3.sh

sbatch: error: AssocMaxSubmitJobLimit sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits) Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in sys.exit(main()) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main """) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in call return self.func(*args, kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run ret = cli(args) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in call return func(posargs, kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli return func('{0} {1}'.format(name, command), args) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in call return func(posargs, **kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 297, in solve jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=3) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 258, in antrax_hpc_job jid = submit_slurm_job_file(jobfile, waitfor=waitfor) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 80, in submit_slurm_job_file jid = out.split()[-1] IndexError: list index out of range If I add --dry, I get a different error:

$ antrax solve H1CN0304/ --step 2 --hpc --dry --hpc-options @.***,cpus=4,mem-per-cpu=4000,time=24:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in H1CN0304/antrax/logs/hpc_solve1.sh

Dry run, no job submitted.

Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in sys.exit(main()) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main """) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in call return self.func(*args, kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run ret = cli(args) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in call return func(posargs, kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli return func('{0} {1}'.format(name, command), args) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in call return func(posargs, **kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 293, in solve jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=1) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 265, in antrax_hpc_job return jid UnboundLocalError: local variable 'jid' referenced before assignment But the .sh file it generated has --step 1 in it, although I asked for --step 2

As far as I understand, sbatch path/to/hpc_solve1.sh in this case should be equivalent to starting the jobs through antrax interface with --step 1, is that right?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Social-Evolution-and-Behavior/anTraX/issues/20#issuecomment-814083037, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACERP5XNK2DQIMW7ZRZZQ4DTHL5RPANCNFSM42OSYFEQ.

janamach commented 3 years ago

One of the 60 jobs in step 1 is failing consistently, while all other 59 finished successfully. The log says:

============================= JOB FEEDBACK =============================

NodeName=uc2n405
Job ID: 19453304
Array Job ID: 19453240_50
Cluster: uc2
User/Group: fr_jm1121/fr_fr
State: FAILED (exit code 1)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:01:04 core-walltime
Job Wall-clock time: 00:00:16
Memory Utilized: 1.02 MB
Memory Efficiency: 0.01% of 15.62 GB

What could be a possible reason? Is there a way to "rescue" this?

asafgal commented 3 years ago

Can you look at the corresponding anTraX-generated logs? These will be session/logs/hpc_solve1_50.log and session/logs/matlab_solve_m_50.log

janamach commented 3 years ago

The text above is from hpc_solve1_50.log, the corresponding matlab_solve_m_50.log has not been generated.

While looking at the matlab_solve_m_*.log 's, I found more problems that were not reflected in hpc_solve1_*.log. I looked for logs that did not have the word "Done" in them with:

$ grep -rHnoL "Done" matlab_solve_m*
matlab_solve_m_21.log
matlab_solve_m_25.log
matlab_solve_m_44.log
matlab_solve_m_54.log
matlab_solve_m_59.log
matlab_solve_m_60.log

All had the same UnrecognizedVarName error:

$ cat matlab_solve_m_59.log 
18:22:16 -I- Reading video information from file
18:22:20 -I- Loading trgraph from antrax/graphs/graph_59_59.mat
Error using tracklet/load_ids (line 746)
Unrecognized table variable name 'tracklet'.
Error in trgraph/load_ids (line 667)

Error in trgraph.load (line 891)

Error in trhandles/loaddata (line 607)

Error in solve_single_movie (line 52)

Error in antrax_mcr_interface (line 30)
MATLAB:table:UnrecognizedVarName
asafgal commented 3 years ago

The UnrecognizedVarName error seems to be caused by the fact there are no classified tracklets in the video (check to see if antrax/labels/autoids_59.csv is indeed empty). This probably because either you didn't had any detections in those videos, or only multi-ant detections. Either way, I'll need to patch this. I guess I never tested the software with such a sparse tracking problem. You might be able to ignore this issue for now and continue to the next steps, but it also possible that the next steps will complain as well.

As for the error in video #50, I'm not sure. It seems the crash happened before matlab was even started, which is weird. Can you verify that the data files exist? These should be:

antrax/graphs/graph_50_50.mat antrax/tracklets/trdata_50_50.mat antrax/images/images_50_50.mat antrax/labels/autoids_50_50.mat

Also try to take a look in the logs of the previous steps, maybe there will be some clues there.

janamach commented 3 years ago

check to see if antrax/labels/autoids_59.csv is indeed empty

No, none of the ones that showed the UnrecognizedVarName error are empty, they look pretty normal to me:

$ head autoids_59.csv 
tracklet,label,score,best_frame
trj_id10_ti59_13365_tf59_13365,Unknown,0,0
trj_id10_ti59_13372_tf59_13372,GGY,0.9986485838890076,1
trj_id10_ti59_13373_tf59_13373,Unknown,0,0
trj_id10_ti59_13375_tf59_13375,Unknown,0,0

Can you verify that the data files exist? These should be:

antrax/graphs/graph_50_50.mat

Exists!

antrax/tracklets/trdata_50_50.mat

Did you mean trdata_50.mat? That exists.

antrax/images/images_50_50.mat

Did you mean images_50.mat? That exists too.

antrax/labels/autoids_50_50.mat

This doesn't exist. If you meant csv, then there is a file for each video.

asafgal commented 3 years ago

ok, weird.

This will need to be debugged on a local machine. Can you sync your data back?

Try to run solve step 1 for video 50 and see it crashes and why.

For the other error, try loading the data in an interactive matlab session with:

Trck = trhandles(uigetdir);
G = Trck.loaddata(59);
janamach commented 3 years ago

To keep it simple, I will compare 59 (that failed above) to 58 (completed successfully).

Running solve with either MCR or MATLAB 2019a gives the MATLAB:table:UnrecognizedVarName error in the log, but not in terminal:

$ antrax solve --step 1 --movlist 59 H1CN0304/

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

07/04/21 16:14:39 -I- Starting 2 workers
07/04/21 16:14:39 -I- Started solve movie 59
07/04/21 16:14:39 -D- running matlab mcr 
07/04/21 16:14:39 -D- command is: /home/jana/src/anTraX/bin/antrax_glnxa64_mcr_interface solve_single_movie H1CN0304/ 59 trackingdirname antrax
07/04/21 16:14:39 -D- matlab app exited with code None
07/04/21 16:15:29 -I- Finished solve movie 59
07/04/21 16:15:29 -I- Workers closed

Log with MCR:

$ cat H1CN0304/antrax/logs/matlab_solve_m_59.log 
16:14:53 -D- initializing expreader object
16:14:53 -I- Reading video information from file
16:14:57 -I- Loading trgraph from antrax/graphs/graph_59_59.mat
Error using tracklet/load_ids (line 746)
Unrecognized table variable name 'tracklet'.
Error in trgraph/load_ids (line 667)

Error in trgraph.load (line 891)

Error in trhandles/loaddata (line 607)

Error in solve_single_movie (line 52)

Error in antrax_mcr_interface (line 30)
MATLAB:table:UnrecognizedVarName

Log with MATLAB:

$ cat H1CN0304/antrax/logs/matlab_solve_m_59.log
16:46:49 -D- initializing expreader object
16:46:50 -I- Reading video information from file
16:46:54 -I- Loading trgraph from antrax/graphs/graph_59_59.mat
Error using tracklet/load_ids (line 746)
Unrecognized table variable name 'tracklet'.

Error in trgraph/load_ids (line 675)
                G.trjs.load_ids;

Error in trgraph.load (line 899)
                G.load_ids;

Error in trhandles/loaddata (line 607)
                GS = trgraph.load(Trck,movlist);

Error in solve_single_movie (line 52)
G = Trck.loaddata(m,colony);

Doing the same with 58 gives the same output in terminal, but a different looking log:

$ antrax solve --step 1 --movlist 58 H1CN0304/

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

07/04/21 16:18:31 -I- Starting 2 workers
07/04/21 16:18:31 -I- Started solve movie 58
07/04/21 16:18:31 -D- running matlab mcr 
07/04/21 16:18:31 -D- command is: /home/jana/src/anTraX/bin/antrax_glnxa64_mcr_interface solve_single_movie H1CN0304/ 58 trackingdirname antrax
07/04/21 16:18:31 -D- matlab app exited with code None
07/04/21 16:42:02 -I- Finished solve movie 58
07/04/21 16:42:02 -I- Workers closed
$ cat H1CN0304/antrax/logs/matlab_solve_m_58.log
16:18:43 -D- initializing expreader object
16:18:43 -I- Reading video information from file
16:18:47 -I- Loading trgraph from antrax/graphs/graph_58_58.mat
16:19:41 -I- Finished loading trgraph with 16476 tracklets
16:19:42 -I- Loading ids
16:19:52 -I- Finding single ant nodes
16:19:54 -I- Some preperations
16:19:56 -I- Resetting graph id assigments
16:19:56 -I- Filtering out tracklets identified as non-ant
16:19:56 -I- ...18 tracklets classified as no-ant were filtered
16:19:56 -I- ...8727 short, unconnected and unidentified tracklets were filtered
16:19:56 -I- Propagating ids from src tracklets
16:19:59 -I-     ...finished 1000/3377
16:19:59 -I-     ...finished 2000/3377
16:19:59 -I-     ...finished 3000/3377
16:19:59 -I- Propagation loops

...

16:39:59 -I- ...working on any_ant
16:40:00 -I- ......found 288 cc's 
16:40:00 -I- ......filtered 1 cc's
16:40:02 -I- ......pruned 18 nodes
16:40:02 -I- Propagation loops
16:40:03 -I-     ...assigned 0 tracklets
16:40:03 -I- Biconnected components condition (positive)
16:40:09 -I-     ...assigned 0 tracklets
16:40:09 -I- Assigning ids to tracklets
16:40:09 -I- Saving
16:41:56 -G- Done

For the interactive matlab session (59 vs 58):

>> G = Trck.loaddata(59);
16:20:11 -I- Loading trgraph from antrax/graphs/graph_59_59.mat
Error using tracklet/load_ids (line 746)
Unrecognized table variable name 'tracklet'.

Error in trgraph/load_ids (line 675)
                G.trjs.load_ids;

Error in trgraph.load (line 899)
                G.load_ids;

Error in trhandles/loaddata (line 607)
                GS = trgraph.load(Trck,movlist);
>> G = Trck.loaddata(58);
16:23:17 -I- Loading trgraph from antrax/graphs/graph_58_58.mat
16:24:21 -I- Finished loading trgraph with 16476 tracklets
asafgal commented 3 years ago

In the matlab command line, try loading the problematic autoids file and display the generated table:

f = 'antrax/labels/autoids_59_59.csv';
T = readtable(f);
head(T)

Also, run locally solve on video 50, which had a different issue.

janamach commented 3 years ago

Hmmmm....

>> f = 'antrax/labels/autoids_59.csv';
>> T = readtable(f);
>> head(T)

ans =

  8×6 table

    Var1      Var2      Var3     Var4      Var5                   Var6              
    _____    ______    ______    _____    ______    ________________________________

    'trj'    'id10'    'ti59'    13365    'tf59'    '13365,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13372    'tf59'    '13372,GGY,0.9986485838890076,1'
    'trj'    'id10'    'ti59'    13373    'tf59'    '13373,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13375    'tf59'    '13375,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13381    'tf59'    '13381,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13385    'tf59'    '13385,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13391    'tf59'    '13391,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13393    'tf59'    '13393,Unknown,0,0'             

58 looks different:

>> f = 'antrax/labels/autoids_58.csv';
>> T = readtable(f);
>> head(T)

ans =

  8×4 table

                tracklet                  label       score     best_frame
    ________________________________    _________    _______    __________

    'trj_id10_ti58_10117_tf58_10117'    'GGY'        0.99987        1     
    'trj_id10_ti58_1139_tf58_1139'      'Unknown'          0        0     
    'trj_id10_ti58_1364_tf58_1364'      'Unknown'          0        0     
    'trj_id10_ti58_1372_tf58_1372'      'Unknown'          0        0     
    'trj_id10_ti58_1389_tf58_1389'      'Unknown'          0        0     
    'trj_id10_ti58_1395_tf58_1395'      'GGY'        0.99884        1     
    'trj_id10_ti58_1401_tf58_1401'      'Unknown'          0        0     
    'trj_id10_ti58_1405_tf58_1405'      'GGY'        0.99956        1     

Looks like underscores were turned into commas in 59... In bash these two files look very similar:

$ head autoids_59.csv 
tracklet,label,score,best_frame
trj_id10_ti59_13365_tf59_13365,Unknown,0,0
trj_id10_ti59_13372_tf59_13372,GGY,0.9986485838890076,1
trj_id10_ti59_13373_tf59_13373,Unknown,0,0
trj_id10_ti59_13375_tf59_13375,Unknown,0,0
trj_id10_ti59_13381_tf59_13381,Unknown,0,0
trj_id10_ti59_13385_tf59_13385,Unknown,0,0
trj_id10_ti59_13391_tf59_13391,Unknown,0,0
trj_id10_ti59_13393_tf59_13393,Unknown,0,0
trj_id10_ti59_13396_tf59_13396,Unknown,0,0

$ head autoids_58.csv 
tracklet,label,score,best_frame
trj_id10_ti58_10117_tf58_10117,GGY,0.9998655319213867,1
trj_id10_ti58_1139_tf58_1139,Unknown,0,0
trj_id10_ti58_1364_tf58_1364,Unknown,0,0
trj_id10_ti58_1372_tf58_1372,Unknown,0,0
trj_id10_ti58_1389_tf58_1389,Unknown,0,0
trj_id10_ti58_1395_tf58_1395,GGY,0.9988380074501038,1
trj_id10_ti58_1401_tf58_1401,Unknown,0,0
trj_id10_ti58_1405_tf58_1405,GGY,0.9995608925819397,1
trj_id10_ti58_1409_tf58_1409,Unknown,0,0

Also, run locally solve on video 50, which had a different issue.

Running. This one should take longer.

asafgal commented 3 years ago

That's odd. Try giving an explicit delimiter:

f = 'antrax/labels/autoids_59_59.csv';
T = readtable(f, 'Delimiter', ',');
head(T)
janamach commented 3 years ago

Forcing it worked:

>> f = 'antrax/labels/autoids_59.csv';
>> T = readtable(f);   
>> head(T)

ans =

  8x6 table

    Var1      Var2      Var3     Var4      Var5                   Var6              
    _____    ______    ______    _____    ______    ________________________________

    'trj'    'id10'    'ti59'    13365    'tf59'    '13365,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13372    'tf59'    '13372,GGY,0.9986485838890076,1'
    'trj'    'id10'    'ti59'    13373    'tf59'    '13373,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13375    'tf59'    '13375,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13381    'tf59'    '13381,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13385    'tf59'    '13385,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13391    'tf59'    '13391,Unknown,0,0'             
    'trj'    'id10'    'ti59'    13393    'tf59'    '13393,Unknown,0,0'             

>> T = readtable(f, 'Delimiter', ',');
>> head(T)

ans =

  8x4 table

                tracklet                  label       score     best_frame
    ________________________________    _________    _______    __________

    'trj_id10_ti59_13365_tf59_13365'    'Unknown'          0        0     
    'trj_id10_ti59_13372_tf59_13372'    'GGY'        0.99865        1     
    'trj_id10_ti59_13373_tf59_13373'    'Unknown'          0        0     
    'trj_id10_ti59_13375_tf59_13375'    'Unknown'          0        0     
    'trj_id10_ti59_13381_tf59_13381'    'Unknown'          0        0     
    'trj_id10_ti59_13385_tf59_13385'    'Unknown'          0        0     
    'trj_id10_ti59_13391_tf59_13391'    'Unknown'          0        0     
    'trj_id10_ti59_13393_tf59_13393'    'Unknown'          0        0     
asafgal commented 3 years ago

I have no explanation to this behavior...

Anyhow, I tried to patch the issue on debug-jana branch, see if it works. It also fixes the other small issues we had in this thread and the previous... I haven't tested it, so issues might pop up.

janamach commented 3 years ago

You are very efficient, thank you!

The readtable thing worked locally with $ antrax solve H1CN0304/ --step 1 --movlist 59:

Before pull:

$ cat matlab_solve_m_59.log 
08:56:02 -D- initializing expreader object
08:56:02 -I- Reading video information from file          
08:56:06 -I- Loading trgraph from antrax/graphs/graph_59_59.mat
Error using tracklet/load_ids (line 746)                                              
Unrecognized table variable name 'tracklet'.   
Error in trgraph/load_ids (line 667)  

Error in trgraph.load (line 891)

Error in trhandles/loaddata (line 607)  

Error in solve_single_movie (line 52)                                                 

Error in antrax_mcr_interface (line 30)  
MATLAB:table:UnrecognizedVarName        

After pull:

$ head matlab_solve_m_59.log 
08:57:32 -D- initializing expreader object
08:57:32 -I- Reading video information from file
08:57:36 -I- Loading trgraph from antrax/graphs/graph_59_59.mat
08:58:02 -I- Finished loading trgraph with 9369 tracklets
08:58:03 -I- Loading ids
08:58:06 -I- Finding single ant nodes
08:58:07 -I- Some preperations
08:58:08 -I- Looking for bottleneck pairs
08:58:09 -I- done distance mat
09:00:59 -I- Resetting graph id assigments

$ tail matlab_solve_m_59.log 
09:14:30 -I- ......found 359 cc's 
09:14:30 -I- ......filtered 0 cc's
09:14:32 -I- ......pruned 0 nodes
09:14:32 -I- Propagation loops
09:14:32 -I-     ...assigned 0 tracklets
09:14:32 -I- Biconnected components condition (positive)
09:14:35 -I-     ...assigned 0 tracklets
09:14:35 -I- Assigning ids to tracklets
09:14:35 -I- Saving
09:15:33 -G- Done

There's another twist: I ran the solve step on a local computer with MATLAB and all files (including 50) were processed successfully and the xy csv files were generated for each video. It took it more than a day to finish, I saw the result just now.

I am now processing another experiment on the HPC starting with tracking. I got to the solve step yesterday, but it failed as multiple jobs ran into the readtable weirdness. I will let you know how it goes :-)

janamach commented 3 years ago

Looks like https://github.com/Social-Evolution-and-Behavior/anTraX/commit/3ce63fdedd34dcc62adea5aafdc6329f370e026a worked: I ran solve for 90 videos and none of them ran into that strange readtable problem in step 1. The last one, 90, showed MATLAB:badsubscript as it barely had any tracklets, I hope it doesn't affect the further steps.

Commit https://github.com/Social-Evolution-and-Behavior/anTraX/commit/5f0cb61fdc27a470b7885a4c8b6364dee013b79b didn't seem to help though, the step option is still being ignored:

$ antrax solve CN0402/ --step 3 --hpc --hpc-options partition=single,cpus=2,mem-per-cpu=2000,time=72:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in CN0402/antrax/logs/hpc_solve1.sh

Job number 19458033 was submitted

Jobfile created in CN0402/antrax/logs/hpc_solve2.sh

Job number 19458034 was submitted

Jobfile created in CN0402/antrax/logs/hpc_solve3.sh

sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in <module>
    sys.exit(main())
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main
    """)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli
    return func('{0} {1}'.format(name, command), *args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 297, in solve
    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=3)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 258, in antrax_hpc_job
    jid = submit_slurm_job_file(jobfile, waitfor=waitfor)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 80, in submit_slurm_job_file
    jid = out.split()[-1]
IndexError: list index out of range

Also with --dry:

$ antrax solve CN0402/ --step 2 --dry --hpc --hpc-options partition=single,cpus=2,mem-per-cpu=2000,time=72:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in CN0402/antrax/logs/hpc_solve1.sh

Dry run, no job submitted.

Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in <module>
    sys.exit(main())
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main
    """)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli
    return func('{0} {1}'.format(name, command), *args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 293, in solve
    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=1)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/hpc.py", line 265, in antrax_hpc_job
    return jid
UnboundLocalError: local variable 'jid' referenced before assignment
asafgal commented 3 years ago

I fixed the dry run issue.

As for the single step run - can you verify that you are on the debug branch on the HPC? If you indeed are, can you paste here the "solve" function in the cli.py file?

janamach commented 3 years ago

Thank you for fixing all these things, I just finished processing the new experiment with 90 videos, I did not run into any serious errors and the files in antdata were generated.

For the single step issue:

$ git branch
* debug-jana
  master

$ less antrax/cli.py
def solve(explist, *, glist: parse_movlist=None, movlist: parse_movlist=None, clist: parse_movlist=None, mcr=False,
          nw=2, hpc=False, hpc_options: parse_hpc_options={}, missing=False, session=None, dry=False, step=0):
    """Run propagation step"""

    explist = parse_explist(explist, session)
    mcr = mcr or ANTRAX_USE_MCR
    hpc = hpc or ANTRAX_HPC

    if hpc:

        for e in explist:

            eglist = glist if glist is not None else e.glist
            emlist = [e.ggroups[g - 1] for g in eglist]
            emlist = [m for grp in emlist for m in grp]

            hpc_options['dry'] = dry
            hpc_options['classifier'] = classifier
            hpc_options['missing'] = missing
            hpc_options['glist'] = eglist
            hpc_options['movlist'] = emlist

            if e.prmtrs['geometry_multi_colony']:
                eclist = clist if clist is not None else e.clist
                for c in eclist:
                    hpc_options['c'] = c
                    hpc_options['waitfor'] = None
                    if step == 0 or step == 1:
                        jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=1)
                        hpc_options['waitfor'] = jid
                    if step == 0 or step == 2:
                        jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=2)
                        hpc_options['waitfor'] = jid
                    if step == 0 or step == 3:
                        jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=3)
            else:
                hpc_options['c'] = None
                hpc_options['waitfor'] = None
                if step == 0 or step == 1:
                    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=1)
                    hpc_options['waitfor'] = jid
                if step == 0 or step == 2:
                    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=2)
                    hpc_options['waitfor'] = jid
                if step == 0 or step == 3:
                    jid = antrax_hpc_job(e, 'solve', opts=hpc_options, solve_step=3)
    else:

        Q = MatlabQueue(nw=nw, mcr=mcr)

        for e in explist:

            eglist = glist if glist is not None else e.glist
            eclist = clist if clist is not None else e.clist
            emlist = [e.ggroups[g - 1] for g in eglist]
            emlist = [m for grp in emlist for m in grp]
            if movlist is not None:
                emlist = [m for m in emlist if m in movlist]

            if step == 0 or step == 1:
                if e.prmtrs['geometry_multi_colony']:
                    for c in eclist:
                        for m in emlist:
                            w = {'fun': 'solve_single_movie'}
                            w['args'] = [e.expdir, m, 'trackingdirname', e.session, 'colony', c]
                            w['diary'] = join(e.logsdir, 'matlab_solve_m_' + str(m) + '_c_' + str(c) + '.log')
                            w['str'] = 'solve colony ' + str(c) + ' movie ' + str(m)
                            Q.put(w)
                else:
                    for m in emlist:
                        w = {'fun': 'solve_single_movie'}
                        w['args'] = [e.expdir, m, 'trackingdirname', e.session]
                        w['diary'] = join(e.logsdir, 'matlab_solve_m_' + str(m) + '.log')
                        w['str'] = 'solve movie ' + str(m)
                        Q.put(w)

                # wait for single movie tasks to complete
                Q.join()

            # stitch
            if step == 0 or step == 2:
                if e.prmtrs['geometry_multi_colony']:
                    for c in eclist:
                        for g in eglist:
                            w = {'fun': 'solve_across_movies'}
                            w['args'] = [e.expdir, g, 'trackingdirname', e.session, 'colony', c]
                            w['diary'] = join(e.logsdir, 'matlab_solve_g_' + str(g) + '_c_' + str(c) + '.log')
                            w['str'] = 'solve stitch colony ' + str(c) + ' graph ' + str(g)
                            Q.put(w)
                else:
                    for g in eglist:
                        w = {'fun': 'solve_across_movies'}
                        w['args'] = [e.expdir, g, 'trackingdirname', e.session]
                        w['diary'] = join(e.logsdir, 'matlab_solve_g_' + str(g) + '.log')
                        w['str'] = 'solve stitch graph ' + str(g)
                        Q.put(w)

                # wait for stitch to finish
                Q.join()

            if step == 0 or step == 3:
                if e.prmtrs['geometry_multi_colony']:
                    for c in eclist:
                        for m in emlist:
                            w = {'fun': 'export_single_movie'}
                            w['args'] = [e.expdir, m, 'trackingdirname', e.session, 'colony', c]
                            w['diary'] = join(e.logsdir, 'matlab_export_m_' + str(m) + '_c_' + str(c) + '.log')
                            w['str'] = 'export colony ' + str(c) + ' movie ' + str(m)
                            Q.put(w)
                else:
                    for m in emlist:
                        w = {'fun': 'export_single_movie'}
                        w['args'] = [e.expdir, m, 'trackingdirname', e.session]
                        w['diary'] = join(e.logsdir, 'matlab_export_m_' + str(m) + '.log')
                        w['str'] = 'export movie ' + str(m)
                        Q.put(w)

                # wait for stitch to finish
                Q.join()

        # close
        Q.stop_workers()
janamach commented 3 years ago

P.S. All this was now done on HPC

janamach commented 3 years ago

Unfortunately, there are more issues with that dataset despite it completeling what seemed successfully.

  1. Some csv files have not been generated even though the videos were not empty. For all of the missing csv files matlab_exportm*.log showed a MATLAB:badsubscript error:
$ grep -rHno "MATLAB:badsubscript" matlab_export_m_* | sort
matlab_export_m_16.log:11:MATLAB:badsubscript
matlab_export_m_30.log:11:MATLAB:badsubscript
matlab_export_m_45.log:11:MATLAB:badsubscript
matlab_export_m_48.log:11:MATLAB:badsubscript
matlab_export_m_49.log:11:MATLAB:badsubscript
matlab_export_m_52.log:11:MATLAB:badsubscript
matlab_export_m_59.log:11:MATLAB:badsubscript
matlab_export_m_62.log:11:MATLAB:badsubscript
matlab_export_m_64.log:11:MATLAB:badsubscript
matlab_export_m_65.log:11:MATLAB:badsubscript
matlab_export_m_66.log:11:MATLAB:badsubscript
matlab_export_m_68.log:11:MATLAB:badsubscript
matlab_export_m_70.log:11:MATLAB:badsubscript
matlab_export_m_71.log:11:MATLAB:badsubscript
matlab_export_m_72.log:11:MATLAB:badsubscript
matlab_export_m_78.log:11:MATLAB:badsubscript
matlab_export_m_80.log:11:MATLAB:badsubscript
matlab_export_m_81.log:11:MATLAB:badsubscript
matlab_export_m_82.log:11:MATLAB:badsubscript
matlab_export_m_83.log:11:MATLAB:badsubscript
matlab_export_m_85.log:11:MATLAB:badsubscript
matlab_export_m_90.log:11:MATLAB:badsubscript

$ for i in {1..90}; do if [ -f ../antdata/xy_${i}_${i}.csv ]; then : ; else echo "Missing: ${i}" ; fi; done
Missing: 16
Missing: 30
Missing: 45
Missing: 48
Missing: 49
Missing: 52
Missing: 59
Missing: 62
Missing: 64
Missing: 65
Missing: 66
Missing: 68
Missing: 70
Missing: 71
Missing: 72
Missing: 78
Missing: 80
Missing: 81
Missing: 82
Missing: 83
Missing: 85
Missing: 90

Maybe relatedly, MATLAB:UndefinedFunction and MATLAB:badsubscript were popping out throughout the whole process:

$ grep -rHno "MATLAB:UndefinedFunction" | sort
matlab_solve_m_16.log:17:MATLAB:UndefinedFunction
matlab_solve_m_45.log:17:MATLAB:UndefinedFunction
matlab_solve_m_48.log:17:MATLAB:UndefinedFunction
matlab_solve_m_49.log:17:MATLAB:UndefinedFunction
matlab_solve_m_52.log:17:MATLAB:UndefinedFunction
matlab_solve_m_59.log:17:MATLAB:UndefinedFunction
matlab_solve_m_62.log:17:MATLAB:UndefinedFunction
matlab_solve_m_65.log:17:MATLAB:UndefinedFunction
matlab_solve_m_68.log:17:MATLAB:UndefinedFunction
matlab_solve_m_70.log:17:MATLAB:UndefinedFunction
matlab_solve_m_71.log:17:MATLAB:UndefinedFunction
matlab_solve_m_78.log:17:MATLAB:UndefinedFunction
matlab_solve_m_80.log:17:MATLAB:UndefinedFunction
matlab_solve_m_81.log:17:MATLAB:UndefinedFunction
matlab_solve_m_85.log:17:MATLAB:UndefinedFunction
matlab_track_m_16.log:77:MATLAB:UndefinedFunction
matlab_track_m_45.log:77:MATLAB:UndefinedFunction
matlab_track_m_48.log:86:MATLAB:UndefinedFunction
matlab_track_m_49.log:77:MATLAB:UndefinedFunction
matlab_track_m_52.log:77:MATLAB:UndefinedFunction
matlab_track_m_59.log:77:MATLAB:UndefinedFunction
matlab_track_m_62.log:77:MATLAB:UndefinedFunction
matlab_track_m_65.log:77:MATLAB:UndefinedFunction
matlab_track_m_68.log:77:MATLAB:UndefinedFunction
matlab_track_m_70.log:77:MATLAB:UndefinedFunction
matlab_track_m_71.log:77:MATLAB:UndefinedFunction
matlab_track_m_78.log:77:MATLAB:UndefinedFunction
matlab_track_m_80.log:77:MATLAB:UndefinedFunction
matlab_track_m_81.log:77:MATLAB:UndefinedFunction
matlab_track_m_85.log:77:MATLAB:UndefinedFunction

$ grep -rHno "MATLAB:badsubscript" | sort
matlab_export_m_16.log:11:MATLAB:badsubscript
matlab_export_m_30.log:11:MATLAB:badsubscript
matlab_export_m_45.log:11:MATLAB:badsubscript
matlab_export_m_48.log:11:MATLAB:badsubscript
matlab_export_m_49.log:11:MATLAB:badsubscript
matlab_export_m_52.log:11:MATLAB:badsubscript
matlab_export_m_59.log:11:MATLAB:badsubscript
matlab_export_m_62.log:11:MATLAB:badsubscript
matlab_export_m_64.log:11:MATLAB:badsubscript
matlab_export_m_65.log:11:MATLAB:badsubscript
matlab_export_m_66.log:11:MATLAB:badsubscript
matlab_export_m_68.log:11:MATLAB:badsubscript
matlab_export_m_70.log:11:MATLAB:badsubscript
matlab_export_m_71.log:11:MATLAB:badsubscript
matlab_export_m_72.log:11:MATLAB:badsubscript
matlab_export_m_78.log:11:MATLAB:badsubscript
matlab_export_m_80.log:11:MATLAB:badsubscript
matlab_export_m_81.log:11:MATLAB:badsubscript
matlab_export_m_82.log:11:MATLAB:badsubscript
matlab_export_m_83.log:11:MATLAB:badsubscript
matlab_export_m_85.log:11:MATLAB:badsubscript
matlab_export_m_90.log:11:MATLAB:badsubscript
matlab_solve_g_2.log:37:MATLAB:badsubscript
matlab_solve_g_3.log:37:MATLAB:badsubscript
matlab_solve_g_4.log:37:MATLAB:badsubscript
matlab_solve_g_5.log:37:MATLAB:badsubscript
matlab_solve_m_30.log:25:MATLAB:badsubscript
matlab_solve_m_64.log:25:MATLAB:badsubscript
matlab_solve_m_66.log:25:MATLAB:badsubscript
matlab_solve_m_72.log:25:MATLAB:badsubscript
matlab_solve_m_82.log:26:MATLAB:badsubscript
matlab_solve_m_83.log:25:MATLAB:badsubscript
matlab_solve_m_90.log:25:MATLAB:badsubscript
  1. The second problem is that validate does not work on this dataset, but works on other datasets. tried it with both MCR and MATLAB and on debug-jana and master branch:
$ antrax validate CN0402/

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

11:43:15 -D- initializing expreader object
11:43:15 -I- Reading video information from file
Subscripted assignment between dissimilar structures.

Error in trhandles/loadxy (line 514)
                    xy(i) = load([xydir,xyfiles{i}]);

Error in validate_tracking/set_experiment (line 266)
            [app.XY,frames] = app.Trck.loadxy('movlist',app.ti.m:app.tf.m,'type',app.type);

Error in validate_tracking/startupFcn (line 441)
            set_experiment(app, Trck, p.Results.session)

Error in validate_tracking (line 659)
            runStartupFcn(app, @(app)startupFcn(app, varargin{:}))

Traceback (most recent call last):
  File "/home/jana/anaconda3/envs/antrax/bin/antrax", line 8, in <module>
    sys.exit(main())
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main
    """)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli
    return func('{0} {1}'.format(name, command), *args)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 149, in validate
    launch_matlab_app('validate_tracking', args, mcr=mcr)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 204, in launch_matlab_app
    app = eval('eng.' + appname + '(' + ','.join([str(a) for a in args]) + ')')
  File "<string>", line 1, in <module>
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 71, in __call__
    _stderr, feval=True).result()
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/futureresult.py", line 67, in result
    return self.__future.result(timeout)
  File "/home/jana/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/fevalfuture.py", line 82, in result
    self._result = pythonengine.getFEvalResult(self._future,self._nargout, None, out=self._out, err=self._err)
matlab.engine.MatlabExecutionError: 
  File /home/jana/src/anTraX/matlab/@trhandles/trhandles.m, line 514, in trhandles.loadxy

  File /home/jana/src/anTraX/matlab/apps/validate_tracking.mlapp, line 266, in validate_tracking.set_experiment

  File /home/jana/src/anTraX/matlab/apps/validate_tracking.mlapp, line 441, in validate_tracking.startupFcn

  File /home/jana/src/anTraX/matlab/apps/validate_tracking.mlapp, line 659, in validate_tracking.validate_tracking
Subscripted assignment between dissimilar structures.

$ antrax validate CN0402/

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

09/04/21 11:41:37 -D- running matlab mcr 
09/04/21 11:41:37 -D- command is: /home/jana/src/anTraX/bin/antrax_glnxa64_mcr_interface validate_tracking CN0402/
11:41:46 -D- initializing expreader object
11:41:46 -I- Reading video information from file
Subscripted assignment between dissimilar structures.
Error in trhandles/loadxy (line 514)

Error in validate_tracking/set_experiment (line 254)

Error in validate_tracking/startupFcn (line 429)

Error in appdesigner.internal.service.AppManagementService/tryCallback (line 336)

Error in matlab.apps.AppBase/runStartupFcn (line 41)

Error in validate_tracking (line 640)

Error in antrax_mcr_interface (line 20)
MATLAB:heterogeneousStrucAssignment
09/04/21 11:41:55 -D- matlab app exited with code 249

Maybe it's trying to load a non-existing file? The error inside of one of those logs looks like this:

$ cat matlab_export_m_70.log 
09:23:13 -I- Reading video information from file
09:23:17 -I- Loading trgraph from antrax/graphs/graph_70_70.mat
09:23:18 -I- Finished loading trgraph with 200 tracklets
09:23:18 -I- Loading tracklet data for movie 70
Index in position 2 exceeds array bounds.
Error in trgraph/export_xy (line 82)

Error in export_single_movie (line 52)

Error in antrax_mcr_interface (line 42)
MATLAB:badsubscript

Loading extract-trainset worked and it showed that most blobs were identified as RBR, which is wrong. Could that have contributed to the export error?

asafgal commented 3 years ago

The validate command fails because there is something wrong with the xy files, so let's try and figure that one first.

The extract-trainset command shows you the results of the blob classifier, so if it is completely off, you should try and understand why. However, it should not cause any program crash downstream, just very bad tracking results.

The error in the export log suggests that the solve step fails on that video. Can you see if there is something weird in the corresponding solve logs?

asafgal commented 3 years ago

btw, the single step solve on hpc works properly for me. Did you remember to do pip install (this needs to be done for python code changes, but not for matlab code).

janamach commented 3 years ago

Did you remember to do pip install

Oops :-(

The error in the export log suggests that the solve step fails on that video. Can you see if there is something weird in the corresponding solve logs?

The problems seem to start at the step 1 of solve. E.g.:

$ cat matlab_solve_m_16.log 
07:36:14 -I- Reading video information from file
07:36:22 -I- Loading trgraph from antrax/graphs/graph_16_16.mat
07:36:23 -I- Finished loading trgraph with 166 tracklets
07:36:23 -I- Loading ids
07:36:23 -I- Finding single ant nodes
07:36:23 -I- Some preperations
07:36:23 -I- Looking for bottleneck pairs
07:36:23 -I- done distance mat
Undefined function or variable 'pairs'.
Error in trgraph/get_bottleneck_pairs (line 523)

Error in trgraph/solve (line 28)

Error in solve_single_movie (line 54)

Error in antrax_mcr_interface (line 30)
MATLAB:UndefinedFunction

$ cat matlab_solve_m_30.log 
07:36:14 -I- Reading video information from file
07:36:22 -I- Loading trgraph from antrax/graphs/graph_30_30.mat
07:36:24 -I- Finished loading trgraph with 374 tracklets
07:36:24 -I- Loading ids
07:36:25 -I- Finding single ant nodes
07:36:25 -I- Some preperations
07:36:25 -I- Looking for bottleneck pairs
07:36:25 -I- done distance mat
07:36:25 -I- Resetting graph id assigments
07:36:25 -I- Filtering out tracklets identified as non-ant
07:36:25 -I- ...0 tracklets classified as no-ant were filtered
07:36:25 -I- ...7 short, unconnected and unidentified tracklets were filtered
07:36:25 -I- Propagating ids from src tracklets
07:36:26 -I- Propagation loops
07:36:26 -I-     ...assigned 0 tracklets
07:36:26 -I- Biconnected components condition (positive)
Index in position 2 exceeds array bounds.
Error in trgraph/solve>propagate_all (line 536)

Error in trgraph/solve (line 150)

Error in solve_single_movie (line 54)

Error in antrax_mcr_interface (line 30)
MATLAB:badsubscript

In this case, 16 had the MATLAB:UndefinedFunction during tracking, while 30 finished properly. The classify step finished normally in both cases.

janamach commented 3 years ago

All files that experience MATLAB:UndefinedFunction during track also failed during solve, maybe the fix in #17 will help. Other ones (like 30, see above) had a different error during solve -- MATLAB:badsubscript.

asafgal commented 3 years ago

Yes, all these errors seems related to the degenerated graph case. Let me know how that latest version does.

About the pip install, I like to to use pip install -e <path> for packages under development, as it creates a link to working directory of the package instead of copying the files, so you don't need to install again for every change or branch switching.

janamach commented 3 years ago

Thank you for the pip tip, I was unaware of it :-) The solve thing with the --step option works for me now too, thank you for fixing it!

I got to the solve step with the problematic datasets, here's what I got:

$ cat matlab_solve_m_52.log 
21:33:26 -I- Reading video information from file
21:33:32 -I- Loading trgraph from antrax/graphs/graph_52_52.mat
21:34:03 -I- Finished loading trgraph with 11734 tracklets
21:34:04 -I- Loading ids
21:34:09 -I- Finding single ant nodes
21:34:09 -I- Some preperations
21:34:10 -I- Looking for bottleneck pairs
21:34:13 -I- done distance mat
21:34:13 -I- Resetting graph id assigments
21:34:13 -I- Filtering out tracklets identified as non-ant
21:34:13 -I- ...10530 tracklets classified as no-ant were filtered
21:34:13 -I- ...2013 short, unconnected and unidentified tracklets were filtered
21:34:14 -I- Propagating ids from src tracklets
21:34:14 -I- Propagation loops
21:34:14 -I-     ...assigned 0 tracklets
21:34:14 -I- Biconnected components condition (positive)
Index in position 2 exceeds array bounds.
Error in trgraph/solve>propagate_all (line 536)

Error in trgraph/solve (line 150)

Error in solve_single_movie (line 54)

Error in antrax_mcr_interface (line 30)
MATLAB:badsubscript

This dataset has 90 videos of 40 min. I am processing another dataset that has 60 videos, one hour each, that one takes longer to process and I didn't get to the solve step yet. If that dataset gets through the solve step properly, I will re-slice the videos for this experiment. I will also run the solve step overnight with MATLAB on a local machine to see if this error only occurs with MCR.

janamach commented 3 years ago

On a local machine with MATLAB solve failed too at the same spots. The error looks like this:

$ cat matlab_solve_m_52.log                                                                                          
22:31:17 -D- initializing expreader object
22:31:17 -I- Reading video information from file
22:31:19 -I- Loading trgraph from antrax/graphs/graph_52_52.mat
22:31:48 -I- Finished loading trgraph with 11734 tracklets
22:31:48 -I- Loading ids
22:31:52 -I- Finding single ant nodes
22:31:53 -I- Some preperations
22:31:53 -I- Looking for bottleneck pairs
22:31:55 -I- done distance mat
22:31:55 -I- Resetting graph id assigments
22:31:55 -I- Filtering out tracklets identified as non-ant
22:31:55 -I- ...10530 tracklets classified as no-ant were filtered
22:31:55 -I- ...2013 short, unconnected and unidentified tracklets were filtered
22:31:55 -I- Propagating ids from src tracklets
22:31:56 -I- Propagation loops
22:31:56 -I-     ...assigned 0 tracklets
22:31:56 -I- Biconnected components condition (positive)
Index in position 2 exceeds array bounds.

Error in trgraph/solve>propagate_all (line 536)
G.pairs = G.pairs(argsort(G.pairs(:,3)),:);

Error in trgraph/solve (line 150)
propagate_all(G);

Error in solve_single_movie (line 54)
solve(G,false,false);

I guess the dataset is not good then?

janamach commented 3 years ago

Hi,

Is --movlist supposed to work during the solve step 1 on HPC? It seems to be ignored:

$ antrax solve H1CN0304/ --step 1 --movlist 50 --hpc --hpc-options partition=single,,cpus=3,mem-per-cpu=3000,time=72:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in H1CN0304/antrax/logs/hpc_solve1.sh

Job number 19464706 was submitted

$ squeue -l
Tue Apr 13 11:29:34 2021
             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON) 
19464706_[20-60%60    single slv1:H1C fr_jm112  PENDING       0:00 3-00:00:00      1 (Resources) 
        19464706_1    single slv1:H1C fr_jm112  RUNNING       0:02 3-00:00:00      1 uc2n421 
        19464706_2    single slv1:H1C fr_jm112  RUNNING       0:02 3-00:00:00      1 uc2n421 
        19464706_3    single slv1:H1C fr_jm112  RUNNING       0:02 3-00:00:00      1 uc2n370 
        [...]
asafgal commented 3 years ago

Once again you are right - I fixed the movlist issue.

Also made a new fix to the MATLAB:badsubscript issue. Try it now... Its a program bug, not an issue with your dataset. I just need to catch all the spots that reference the problematic variable. It's hard without being able to replicate the error on my side.

janamach commented 3 years ago

Thank you for fixing these things :-) I am never really sure if I am right about anything.

Also made a new fix to the MATLAB:badsubscript issue. Try it now... Its a program bug, not an issue with your dataset.

I don't seem to be getting these error with a different dataset... Or did you fix this days ago? I changed the dataset that was causing all these problems by re-slicing the videos into 1 hour pieces. I also finally figured out that I need to use a far larger number of epochs during the training step than the default 5, in my case I need more than 20 (45 seems like a good number when running from scratch on a good set of examples) to get loss and accuracy values closer to 0.5 and 0.95 accordingly.

And what does --missing do in the solve context?

$ antrax solve --help

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Usage: antrax solve [OPTIONS] explist

Run propagation step

Arguments:
  explist

Options:
  --clist=PARSE_MOVLIST
  --dry
  --glist=PARSE_MOVLIST
  --hpc
  --hpc-options=PARSE_HPC_OPTIONS    (default: {})
  --mcr
  --missing
  --movlist=PARSE_MOVLIST
  --nw=INT                           (default: 2)
  --session=STR
  --step=INT                         (default: 0)

Other actions:
  -h, --help                        Show the help

I had some jobs fail because I did not allocate enough memory for them. And some jobs seem to fail repeatedly for no obvious reason, but that can be fixed if I remove the hpc_solve1_*.log for that job. Weird.

janamach commented 3 years ago

Once again you are right - I fixed the movlist issue.

Works beautifully!

$ antrax solve JS16/ --step 1 --movlist 2-4 --dry  --hpc --hpc-options partition=single,cpus=3,mem-per-cpu=3000,time=72:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in JS16/antrax_demo/logs/hpc_solve1.sh

Dry run, no job submitted.

$ cat JS16/antrax_demo/logs/hpc_solve1.sh
#!/bin/bash
#SBATCH --job-name=slv1:JS16
#SBATCH --output=JS16/antrax_demo/logs/hpc_solve1_%a.log
#SBATCH --partition=single
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --time=72:00:00
#SBATCH --mem-per-cpu=3000
#SBATCH --array=2-4%3
#SBATCH --mail-type=ALL
#SBATCH --mail-user=None

srun -N1 antrax solve JS16/ --session antrax_demo --movlist $SLURM_ARRAY_TASK_ID  --nw 1  --step 1 --mcr

I used pip install -e ., very handy. Incidentally, it also doesn't prompt the strange HPC permission error I described in #13 as it did with plain pip install ..

asafgal commented 3 years ago

Using --missing with solve will run solve on videos that do not have an xy file, which is the only output file of the step. It is useful if some jobs failed, and you want to run only those. If you don't specify the step, it will run step 1 on the missing videos, then step 2 on all graphs, and then step 3 again on the missing videos.

The MATLAB:badsubscript happens on a very specific case, where the program did not find any topologically equivalent node pairs (see the paper) in the video. I never encountered such a case in my experiments, so it is very likely that you see it only in this specific dataset. Anyhow it is a good idea to patch it, even if you found a workaround, so let me know if it happens again. The fix was in my last commit, not days ago.

Regarding the classifier - definitely! usually 50-100 epochs are needed, depending on the complexity of the problem (number of classes, image resolution, etc.). I usually recommend aiming to at least 0.95 accuracy.

I understand that you already completed tracking of a few datasets, and ran the validation procedure? What accuracy do you see?

janamach commented 3 years ago

No, I am actually slower than it may seem :-/ With small test datasets it worked out really well, but with large ones (e.g., 60 hours) I kept making different silly mistakes that hindered my progress. For example. I realized only yesterday that I need to run the training step much longer. Hopefully I will get to the point where I will run validation on one of the large experiments sometime this week.

asafgal commented 3 years ago

ok, hopefully the effort will pay off!

janamach commented 3 years ago

I think --missing might not be working... One xy file of 60 was not generated, but this restarted all jobs:

$ for i in {1..60}; do if [ -f ~/H2CN0402/antrax/antdata/xy_${i}_${i}.csv ]; then : ; else echo "Missing: ${i}" ; fi; done
Missing: 2
$ antrax solve H2CN0402/ --missing --hpc --hpc-options partition=single,cpus=2,mem-per-cpu=2000,time=72:00:00

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

Jobfile created in H2CN0402/antrax/logs/hpc_solve1.sh

Job number 19468563 was submitted

Jobfile created in H2CN0402/antrax/logs/hpc_solve2.sh

Job number 19468564 was submitted

Jobfile created in H2CN0402/antrax/logs/hpc_solve3.sh
$ cat H2CN0402/antrax/logs/hpc_solve1.sh
#!/bin/bash
#SBATCH --job-name=slv1:H2CN0402
#SBATCH --output=H2CN0402/antrax/logs/hpc_solve1_%a.log
#SBATCH --partition=single
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=72:00:00
#SBATCH --mem-per-cpu=2000
#SBATCH --array=1-60%60
#SBATCH --mail-type=ALL

srun -N1 antrax solve H2CN0402/ --session antrax --movlist $SLURM_ARRAY_TASK_ID  --nw 1  --step 1 --mcr

$ cat H2CN0402/antrax/logs/hpc_solve2.sh
#!/bin/bash
#SBATCH --job-name=slv2:H2CN0402
#SBATCH --output=H2CN0402/antrax/logs/hpc_solve2_%a.log
#SBATCH --partition=single
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=72:00:00
#SBATCH --mem-per-cpu=2000
#SBATCH --array=1-5%5
#SBATCH --mail-type=ALL

srun -N1 antrax solve H2CN0402/ --session antrax --movlist $SLURM_ARRAY_TASK_ID  --nw 1  --step 2 --mcr

$ cat H2CN0402/antrax/logs/hpc_solve3.sh
#!/bin/bash
#SBATCH --job-name=slv3:H2CN0402
#SBATCH --output=H2CN0402/antrax/logs/hpc_solve3_%a.log
#SBATCH --partition=single
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --time=72:00:00
#SBATCH --mem-per-cpu=2000
#SBATCH --array=1-60%60
#SBATCH --mail-type=ALL

srun -N1 antrax solve H2CN0402/ --session antrax --movlist $SLURM_ARRAY_TASK_ID  --nw 1  --step 3 --mcr
janamach commented 3 years ago

The log of the missing file is complaining about a possibly corrupt MAT file. The file is physically there, what do you think could have caused the problem?

$ cat matlab_solve_m_2.log 
08:07:21 -I- Reading video information from file
08:07:27 -I- Loading trgraph from antrax/graphs/graph_2_2.mat
Error using load
Unable to read MAT-file /pfs/data5/home/fr/fr_fr/fr_jm1121/H2CN0402/antrax/graphs/graph_2_2_trjs.mat. File might be corrupt.
Error in trgraph.load (line 886)

Error in trhandles/loaddata (line 607)

Error in solve_single_movie (line 52)

Error in antrax_mcr_interface (line 30)
MATLAB:load:unableToReadMatFile
asafgal commented 3 years ago

Can you try and load the file in matlab using the load command? If its indeed corrupted, it's possible that something interrupted the writing of the file, so it might be just a random thing. Is track step on this video finished properly? Try re-running track for that video.

I'll take a look at the --missing issue tomorrow.

janamach commented 3 years ago

Matlab has the same complaint:

>> addpath(genpath(['.','/matlab']));
>> load antrax/graphs/graph_2_2_trjs.mat
Error using load
Unable to read MAT-file /media/jana/HDD/bw/H2CN0402/antrax/graphs/graph_2_2_trjs.mat. File might be corrupt.

I think I know what I did wrong: I might have started the next step before the previous one finished. On the up side, it was otherwise a very smooth process, from track to solve.

janamach commented 3 years ago

No, something is still wrong. After re-slicing the videos and starting everything from scratch, I had errors during solve steps 2 and 3.

In step 2 it was either MATLAB:badsubscript or MATLAB:load:cantReadFile (?):

$ grep -rHnoL "Done" matlab_solve_g_*
matlab_solve_g_3.log
matlab_solve_g_4.log
matlab_solve_g_5.log
$ cat matlab_solve_g_3.log
00:43:23 -I- Reading video information from file
00:43:32 -I- solving graph from movies 25-36
00:43:32 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
Error using load
Cannot read file /pfs/data5/home/fr/fr_fr/fr_jm1121/H2CN0402/antrax/graphs/graph_25_25.mat.
Error in trgraph.load (line 879)

Error in trhandles/loaddata (line 607)

Error in solve_across_movies (line 70)

Error in antrax_mcr_interface (line 53)
MATLAB:load:cantReadFile
$ cat matlab_solve_g_4.log 
00:43:51 -I- Reading video information from file
00:43:58 -I- solving graph from movies 37-48
00:43:58 -I- Loading trgraph from antrax/graphs/graph_37_37.mat
00:44:10 -I- Loading trgraph from antrax/graphs/graph_38_38.mat
00:44:14 -I- Loading trgraph from antrax/graphs/graph_39_39.mat
00:44:16 -I- Loading trgraph from antrax/graphs/graph_40_40.mat
00:44:17 -I- Loading trgraph from antrax/graphs/graph_41_41.mat
00:44:19 -I- Loading trgraph from antrax/graphs/graph_42_42.mat
00:44:21 -I- Loading trgraph from antrax/graphs/graph_43_43.mat
00:44:22 -I- Loading trgraph from antrax/graphs/graph_44_44.mat
00:44:24 -I- Loading trgraph from antrax/graphs/graph_45_45.mat
00:44:25 -I- Loading trgraph from antrax/graphs/graph_46_46.mat
00:44:27 -I- Loading trgraph from antrax/graphs/graph_47_47.mat
00:44:28 -I- Loading trgraph from antrax/graphs/graph_48_48.mat
00:44:28 -I- Finished loading trgraph with 10016 tracklets
00:44:30 -I- Loading ids
00:44:33 -I- Finding single ant nodes
00:44:33 -I- Some preperations
00:44:34 -I- Filtering out tracklets identified as non-ant
00:44:34 -I- ...690 tracklets classified as no-ant were filtered
00:44:34 -I- ...729 short, unconnected and unidentified tracklets were filtered
00:44:35 -I- Propagating ids from src tracklets
00:44:36 -I-     ...finished 1000/7355
00:44:36 -I-     ...finished 2000/7355
00:44:36 -I-     ...finished 3000/7355
00:44:36 -I-     ...finished 4000/7355
00:44:36 -I-     ...finished 5000/7355
00:44:36 -I-     ...finished 6000/7355
00:44:36 -I-     ...finished 7000/7355
00:44:36 -I- Propagation loops
Index in position 1 exceeds array bounds.
Error in trgraph/solve>propagate_all (line 522)

Error in trgraph/solve (line 150)

Error in solve_across_movies (line 72)

Error in antrax_mcr_interface (line 53)
MATLAB:badsubscript

In step 3:

$ grep -rHnoL "Done" matlab_export_m_*
matlab_export_m_25.log
matlab_export_m_28.log
matlab_export_m_35.log
matlab_export_m_36.log

$ cat matlab_export_m_25.log
00:58:53 -I- Reading video information from file
00:58:58 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
Error using load
Cannot read file /pfs/data5/home/fr/fr_fr/fr_jm1121/H2CN0402/antrax/graphs/graph_25_25.mat.
Error in trgraph.load (line 879)

Error in trhandles/loaddata (line 607)

Error in export_single_movie (line 51)

Error in antrax_mcr_interface (line 42)
MATLAB:load:cantReadFile

None of the previous logs showed the errors.

janamach commented 3 years ago

But running solve on a local machine with MATLAB already showed errors in step 1:

$ grep -rHinoL "Done" matlab_solve_m_* 
matlab_solve_m_25.log
matlab_solve_m_28.log
matlab_solve_m_35.log
matlab_solve_m_36.log

All of those logs show the same error:

$ cat matlab_solve_m_25.log 
09:24:23 -D- initializing expreader object
09:24:23 -I- Reading video information from file
09:24:26 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
Error using load
Cannot read file /media/jana/HDD/H2CN0402/antrax/graphs/graph_25_25.mat.

Error in trgraph.load (line 879)
                load(fname,'G');

Error in trhandles/loaddata (line 607)
                GS = trgraph.load(Trck,movlist);

Error in solve_single_movie (line 52)
G = Trck.loaddata(m,colony);

Errors also appeared during step 2:

$ cat matlab_solve_g_3.log 
09:38:37 -D- initializing expreader object
09:38:37 -I- Reading video information from file
09:38:41 -I- solving graph from movies 25-36
09:38:41 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
Error using load
Cannot read file /media/jana/HDD/H2CN0402/antrax/graphs/graph_25_25.mat.

Error in trgraph.load (line 879)
                load(fname,'G');

Error in trhandles/loaddata (line 607)
                GS = trgraph.load(Trck,movlist);

Error in solve_across_movies (line 70)
G = Trck.loaddata(movlist,colony);

But matlab seems to be able to load the file:

>> load antrax/graphs/graph_25_25.mat     
Warning: Variable 'G' originally saved as a trgraph cannot be instantiated as an object and will be read in as a uint32. 

And in step 3 it was quite expected:

$ grep -rHinoL "Done" matlab_export_m_*
matlab_export_m_25.log
matlab_export_m_28.log
matlab_export_m_35.log
matlab_export_m_36.log
$ cat matlab_export_m_36.log 
10:27:21 -D- initializing expreader object
10:27:21 -I- Reading video information from file
10:27:24 -I- Loading trgraph from antrax/graphs/graph_36_36.mat
Error using load
Cannot read file /media/jana/HDD/H2CN0402/antrax/graphs/graph_36_36.mat.

Error in trgraph.load (line 879)
                load(fname,'G');

Error in trhandles/loaddata (line 607)
                GS = trgraph.load(Trck,movlist);

Error in export_single_movie (line 51)
G = Trck.loaddata(m,colony);
janamach commented 3 years ago

The above was partially solved by re-running the track step for movies 25,28,35,36 on HPC. Step 2 showed the MATLAB:badsubscript error in all logs (5 graphs in total), but step 3 finished successfully and the missing mat/csv files have been generated.

$ cat matlab_solve_g_3.log 
09:26:30 -I- Reading video information from file
09:26:36 -I- solving graph from movies 25-36
09:26:36 -I- Loading trgraph from antrax/graphs/graph_25_25.mat
09:26:48 -I- Loading trgraph from antrax/graphs/graph_26_26.mat
09:26:52 -I- Loading trgraph from antrax/graphs/graph_27_27.mat
09:26:54 -I- Loading trgraph from antrax/graphs/graph_28_28.mat
09:26:57 -I- Loading trgraph from antrax/graphs/graph_29_29.mat
09:26:57 -I- Loading trgraph from antrax/graphs/graph_30_30.mat
09:26:58 -I- Loading trgraph from antrax/graphs/graph_31_31.mat
09:26:59 -I- Loading trgraph from antrax/graphs/graph_32_32.mat
09:26:59 -I- Loading trgraph from antrax/graphs/graph_33_33.mat
09:27:00 -I- Loading trgraph from antrax/graphs/graph_34_34.mat
09:27:01 -I- Loading trgraph from antrax/graphs/graph_35_35.mat
09:27:09 -I- Loading trgraph from antrax/graphs/graph_36_36.mat
09:27:19 -I- Finished loading trgraph with 17914 tracklets
09:27:21 -I- Loading ids
09:27:25 -I- Finding single ant nodes
09:27:26 -I- Some preperations
09:27:28 -I- Filtering out tracklets identified as non-ant
09:27:28 -I- ...8544 tracklets classified as no-ant were filtered
09:27:28 -I- ...6359 short, unconnected and unidentified tracklets were filtered
09:27:29 -I- Propagating ids from src tracklets
09:27:31 -I-     ...finished 1000/7421
09:27:31 -I-     ...finished 2000/7421
09:27:31 -I-     ...finished 3000/7421
09:27:31 -I-     ...finished 4000/7421
09:27:31 -I-     ...finished 5000/7421
09:27:31 -I-     ...finished 6000/7421
09:27:31 -I-     ...finished 7000/7421
09:27:31 -I- Propagation loops
Index in position 1 exceeds array bounds (must not exceed 14008).
Error in trgraph/solve>propagate_all (line 522)

Error in trgraph/solve (line 150)

Error in solve_across_movies (line 72)

Error in antrax_mcr_interface (line 53)
MATLAB:badsubscript
asafgal commented 3 years ago

So, if I understand correctly, the corrupted file issue was solved by the rerun?

Regarding the new MATLAB:badsubscript error, it is different than the previous one we had above. I'm not sure what's going on there. After you tracked some of the videos again, did you also run the classify and solve1?

Step 2 actually "stitch" the graphs of individual videos, and propagate information from one video to another. In practice, it is not actually required, and that is why step 3 is able to finish properly. The tracking might be sub optimal at the interface between the videos.

janamach commented 3 years ago

So, if I understand correctly, the corrupted file issue was solved by the rerun?

Yes. It looks like there was some strange error happening that was not reflected in the logs, but produced some corrupt graph MAT files during track. At least that's my best explanation.

After you tracked some of the videos again, did you also run the classify and solve1?

I tried both actually, both worked. But I went with the latter one. What consequences would re-running track and then going directly to solve have on detections?

asafgal commented 3 years ago

Theoretically, the algorithm is completely deterministic, so the two runs should have the same tracklet graph and tracklet names. However, there are occasionally some small misalignments between runs that I cannot explain.. Also, when you run track, it cleans some of the data generated by later steps, so it is better to run also the downstream steps.

I'm not sure what you mean by "both worked". Was the latest MATLAB:badsubscript in step 2 solved?

janamach commented 3 years ago

Sorry, I made it too confusing. It looks like I've been dealing with two separate problems (they just looked like one at first): xy files not being generated after step 3 and step 2 showing different errors (either MATLAB:badsubscript with MCR or Index in position 1 exceeds array bounds with MATLAB). With "both worked" I was referring to the first problem that was caused by the corrupt graph files generated during track and fixed by re-running either just track and then solve, or track, classify, and solve.

Was the latest MATLAB:badsubscript in step 2 solved?

No, it is still happening.

asafgal commented 3 years ago

ok, so let's try to understand this new MATLAB:badsubscript better (its the same error on MCR/matlab, just reported differently). As I said, it's a different one than the one we had before on this thread. We'll have to do it the painful way, as I can't reproduce it on my side.

I've added a few lines of code to report some info on the problematic place. Run it on interactive matlab:

Trck = trhandles(uigetdir);
solve_across_movies(Trck, 'g', 3);
janamach commented 3 years ago

Hmm, maybe I am doing something wrong here:

>> addpath(genpath(['.','/matlab']));
>> Trck = trhandles(uigetdir);       
Warning: uigetdir is no longer supported when MATLAB is started with the -nodisplay or -noFigureWindows option or there is no display. For more information, see "Changes to
-nodisplay and -noFigureWindows Startup Options" in the MATLAB Release Notes. To view the release note in your system browser, run
web('www.mathworks.com/help/matlab/release-notes.html#br5ktrh-3', '-browser') 
> In warnfiguredialog (line 21)
  In uigetdir (line 60) 
Error using javaObjectEDT
Scalar input must be a java object

Error in matlab.ui.internal.dialog.Dialog/getParentFrame (line 46)
               obj.ParentFrame = javaObjectEDT(com.mathworks.hg.peer.utils.DialogUtilities.createParentWindow);

Error in matlab.ui.internal.dialog.FileSystemChooser/getParentFrame (line 129)
                parframe = getParentFrame@matlab.ui.internal.dialog.Dialog(obj);

Error in matlab.ui.internal.dialog.FolderChooser/doShowDialog (line 70)
            javaMethodEDT('showOpenDialog', obj.Peer, getParentFrame(obj));

Error in matlab.ui.internal.dialog.FolderChooser/show (line 48)
            doShowDialog(obj)

Error in uigetdir_helper (line 32)
    dirdlg.show();

Error in uigetdir (line 61)
[directoryname] = uigetdir_helper(varargin{:});

>> Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException
    at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:204)
    at java.awt.Window.<init>(Window.java:536)
    at java.awt.Frame.<init>(Frame.java:420)
    at javax.swing.JFrame.<init>(JFrame.java:233)
    at com.mathworks.mwswing.MJFrame.<init>(MJFrame.java:108)
    at com.mathworks.mwswing.MJFrame.<init>(MJFrame.java:101)
    at com.mathworks.hg.peer.utils.DialogUtilities$1.runWithOutput(DialogUtilities.java:56)
    at com.mathworks.jmi.AWTUtilities$Invoker$2.watchedRun(AWTUtilities.java:475)
    at com.mathworks.jmi.AWTUtilities$WatchedRunnable.run(AWTUtilities.java:436)
    at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:311)
    at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:758)
    at java.awt.EventQueue.access$500(EventQueue.java:97)
    at java.awt.EventQueue$3.run(EventQueue.java:709)
    at java.awt.EventQueue$3.run(EventQueue.java:703)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74)
    at java.awt.EventQueue.dispatchEvent(EventQueue.java:728)
    at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205)
    at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116)
    at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
    at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93)
    at java.awt.EventDispatchThread.run(EventDispatchThread.java:82)
asafgal commented 3 years ago

uigetdir does not work when you run matlab without a display (eg with the -nodisplay option, or through ssh). You can just enter the path to the expdir instead with Trck=trhandles(path-to-expdir);

On Apr 18, 2021, at 9:01 PM, Jana Mach @.***> wrote:

 Hmm, maybe I am doing something wrong here:

addpath(genpath(['.','/matlab'])); Trck = trhandles(uigetdir);
Warning: uigetdir is no longer supported when MATLAB is started with the -nodisplay or -noFigureWindows option or there is no display. For more information, see "Changes to -nodisplay and -noFigureWindows Startup Options" in the MATLAB Release Notes. To view the release note in your system browser, run web('www.mathworks.com/help/matlab/release-notes.html#br5ktrh-3', '-browser') In warnfiguredialog (line 21) In uigetdir (line 60) Error using javaObjectEDT Scalar input must be a java object

Error in matlab.ui.internal.dialog.Dialog/getParentFrame (line 46) obj.ParentFrame = javaObjectEDT(com.mathworks.hg.peer.utils.DialogUtilities.createParentWindow);

Error in matlab.ui.internal.dialog.FileSystemChooser/getParentFrame (line 129) parframe = @.***Dialog(obj);

Error in matlab.ui.internal.dialog.FolderChooser/doShowDialog (line 70) javaMethodEDT('showOpenDialog', obj.Peer, getParentFrame(obj));

Error in matlab.ui.internal.dialog.FolderChooser/show (line 48) doShowDialog(obj)

Error in uigetdir_helper (line 32) dirdlg.show();

Error in uigetdir (line 61) [directoryname] = uigetdir_helper(varargin{:});

Exception in thread "AWT-EventQueue-0" java.awt.HeadlessException at java.awt.GraphicsEnvironment.checkHeadless(GraphicsEnvironment.java:204) at java.awt.Window.(Window.java:536) at java.awt.Frame.(Frame.java:420) at javax.swing.JFrame.(JFrame.java:233) at com.mathworks.mwswing.MJFrame.(MJFrame.java:108) at com.mathworks.mwswing.MJFrame.(MJFrame.java:101) at com.mathworks.hg.peer.utils.DialogUtilities$1.runWithOutput(DialogUtilities.java:56) at com.mathworks.jmi.AWTUtilities$Invoker$2.watchedRun(AWTUtilities.java:475) at com.mathworks.jmi.AWTUtilities$WatchedRunnable.run(AWTUtilities.java:436) at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:311) at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:758) at java.awt.EventQueue.access$500(EventQueue.java:97) at java.awt.EventQueue$3.run(EventQueue.java:709) at java.awt.EventQueue$3.run(EventQueue.java:703) at java.security.AccessController.doPrivileged(Native Method) at java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:74) at java.awt.EventQueue.dispatchEvent(EventQueue.java:728) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:205) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:116) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:105) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:93) at java.awt.EventDispatchThread.run(EventDispatchThread.java:82) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

janamach commented 3 years ago

Sorry, that was silly of me :-/ With the dataset that had the problem:

>> Trck = trhandles('.');
21:25:09 -I- Loading tracking session from expdir
21:25:17 -I- Reading video information from file
>> solve_across_movies(Trck, 'g', 3);
Error using solve_across_movies (line 11)
Expected a string scalar or character vector for the parameter name.

>>