Social-Evolution-and-Behavior / anTraX

anTraX: high throughput tracking of color-tagged insects
https://antrax.readthedocs.io/
GNU General Public License v3.0
17 stars 3 forks source link

Potential graph corruption at solve 2 resulting three types of error messages in matlab_export_*.log #42

Open lizimai opened 1 year ago

lizimai commented 1 year ago

Hi! I wonder if anyone could help me fix a problem when running antrax on hpc. This is a problem that I encountered when running a big continuous experiment (~288 videos, each contains 6000 frames) on a cluster managed by SLURM.

Although all the previous steps may not produce any errors, three types of error pop up for some videos at the last step of propagation and some seems to be related to graph corruption at solve step 2 since the problematic graphs before step 2 seems to be fine after opened from Matlab on my local machine.

Note that the problem is not present in all videos, only in a subset. After rerunning all the previous steps (track, classification, etc) on the problematic videos, sometimes the problem can be solved by itself. However, I should run all the steps twice (sometimes even more) which is quite time-consuming.

Here are the three types of error messages that I get:

  1. Variable 'trjs' not found
    17:25:39 -I- Loading trgraph from antrax/graphs/C4/graph_10_10.mat
    Warning: Variable 'trjs' not found.
    > In trgraph.load (line 887)
      In trhandles/loaddata (line 609)
      In export_single_movie (line 59)
      In antrax_mcr_interface (line 42)
    Undefined function or variable 'trjs'.
    Error in trgraph.load (line 888)
    Error in trhandles/loaddata (line 609)
    Error in export_single_movie (line 59)
    Error in antrax_mcr_interface (line 42)
    MATLAB:UndefinedFunction
  2. Warning: Variable 'G' not found
    
    14:32:26 -I- Loading trgraph from antrax/graphs/C1/graph_1_1.mat
    Warning: Variable 'G' not found.
    > In trgraph.load (line 880)
      In trhandles/loaddata (line 609)
      In export_single_movie (line 59)
      In antrax_mcr_interface (line 42)
    Reference to non-existent field 'trjs'.
    Error in trgraph.load (line 886)
    Error in trhandles/loaddata (line 609)
    Error in export_single_movie (line 59)
    Error in antrax_mcr_interface (line 42)

MATLAB:nonExistentField


3. File might be corrupt.

14:59:40 -I- Loading trgraph from antrax/graphs/C2/graph_101_101.mat Error using load Unable to read MAT-file /scratch1/users/zimai.li/idol/idol_cam2/antrax/graphs/C2/graph_101_101.mat. File might be corrupt. Error in trgraph.load (line 880) Error in trhandles/loaddata (line 609) Error in export_single_movie (line 59) Error in antrax_mcr_interface (line 42) MATLAB:load:unableToReadMatFile


Thanks a lot in advance!
janamach commented 1 year ago

Did you try opening these files with MATLAB on your local machine? Is the file size of the corrupt files seems reasonable (e.g., compared to "good" .mat files?

Are you running the latest antrax version on all machines?

lizimai commented 1 year ago

Yes, I tried to open these files on my local machines, I can open the file graph of when the error was "Variable 'G' not found" and "Variable 'trjs' not found". But for the "File might be corrupt." it is indeed corrupt. I am running the latest version from the 'master' branch on all machines.

janamach commented 1 year ago

Do the first two files have the variables that were not found ('G' and 'trjs' accordingly)? How does the solve 1 log file look for all three? Anything abnormal compared to the "good" videos? Any errors in the track logs?

asafgal commented 1 year ago

This problem on slurm systems has persisted since I developed anTraX, and I couldn’t find a good explanation for it. As Zimai noticed it usually disappears upon a rerun, and is probably related to highly fragmented graphs. My best guess is that something in matlab doesn’t work properly on these highly parallel systems, and that it relates to the .mat file format. i had a plan to completely change the way anTraX stores it’s data, in hope it will solve the issue or at least will not require rerunning everything. Unfortunately I can’t say when I will have time to do this, or if it will actually help.

Sorry I can’t h lol further at this point :-(

On Nov 3, 2022, at 3:28 PM, Jana Mach @.***> wrote:  Do the first two files have the variables that were not found ('G' and 'trjs' accordingly)? How does the solve 1 log file look for all three? Anything abnormal compared to the "good" videos? Any errors in the track logs?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

lizimai commented 1 year ago

Thank you, Asaf, for the insight! Following this, Do you think it will help if I can concatenate my videos into longer ones to reduce parallels, and see if that improves the outcome? Have you tried to do something like this before?