Social-Evolution-and-Behavior / anTraX

anTraX: high throughput tracking of color-tagged insects
https://antrax.readthedocs.io/
GNU General Public License v3.0
17 stars 3 forks source link

anTraX on HPC and writing to /scratch issue #13

Closed janamach closed 3 years ago

janamach commented 3 years ago

Hi!

I am not sure if you will be able to help with this one, but maybe you've seen something like this before. I got access to a remote HPC cluster (first time for me, I've only used local computers and servers up until now). When running pip install . I get a permission denied error because pip tries to write into the /scratch directory. Generally, the output looks like this:

With virtualenv or anaconda:

Processing /pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX
ERROR: Could not install packages due to an OSError: [('/pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack', '/scratch/pip-req-build-5g6wd_w5/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack', "[Errno 13] Permission denied: '/scratch/pip-req-build-5g6wd_w5/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack'"), ('/pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx', '/scratch/pip-req-build-5g6wd_w5/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx', "[Errno 13] Permission denied: '/scratch/pip-req-build-5g6wd_w5/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx'")]

With global pip and --user flag:

[fr_jm1121@uc2n997 anTraX]$ pip3.6 install . --user
Processing /pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX
Exception:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/lib/python3.6/site-packages/pip/commands/install.py", line 346, in run
    requirement_set.prepare_files(finder)
  File "/usr/lib/python3.6/site-packages/pip/req/req_set.py", line 381, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/usr/lib/python3.6/site-packages/pip/req/req_set.py", line 623, in _prepare_file
    session=self.session, hashes=hashes)
  File "/usr/lib/python3.6/site-packages/pip/download.py", line 809, in unpack_url
    unpack_file_url(link, location, download_dir, hashes=hashes)
  File "/usr/lib/python3.6/site-packages/pip/download.py", line 686, in unpack_file_url
    shutil.copytree(link_path, location, symlinks=True)
  File "/usr/lib64/python3.6/shutil.py", line 365, in copytree
    raise Error(errors)
shutil.Error: [('/pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack', '/scratch/pip-hppf66nt-build/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack', "[Errno 13] Permission denied: '/scratch/pip-hppf66nt-build/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack'"), ('/pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx', '/scratch/pip-hppf66nt-build/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx', "[Errno 13] Permission denied: '/scratch/pip-hppf66nt-build/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx'")]

I was able to install other packages using pip (e.g., pip install pandas), so this issue seems to be more specific.

A Google search didn't help much and I already contacted the HPC technical support for that. I thought I should ask here too in case you've seen this before :-)

asafgal commented 3 years ago

Not sure what's going on, never encountered something like that. you can try forcing the target directory with the --target option, but it will be better to understand whats going on here.

Is your conda environment is a live and well, and located in your home directory?

janamach commented 3 years ago

This is quite bizarre, looks like pip tries to write temporary files to $TMPDIR (which is /scratch by default), but cannot for some reason. My user has read/write permission to /scratch, I can create files and folders there. I can do something like this: mkdir /scratch/tmp_jana/ chmod 777 /scratch/tmp_jana/ TMPDIR=/scratch/tmp_jana/

And then still get the Permission denied error for trying to write into /scratch/tmp_jana/:

anTraX) [fr_jm1121@uc2n996 ~]$ pip install src/anTraX/
Processing ./src/anTraX
ERROR: Could not install packages due to an OSError: [('/pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack', '/scratch/tmp_jana/pip-req-build-6q83vttj/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack', "[Errno 13] Permission denied: '/scratch/tmp_jana/pip-req-build-6q83vttj/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.pack'"), ('/pfs/data5/home/fr/fr_fr/fr_jm1121/src/anTraX/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx', '/scratch/tmp_jana/pip-req-build-6q83vttj/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx', "[Errno 13] Permission denied: '/scratch/tmp_jana/pip-req-build-6q83vttj/.git/objects/pack/pack-d30091f4eb8139eba7e89868113a6ebdca569f82.idx'")]

I tried different paths for $TMPDIR (including $HOME/tmp), but keep getting the same error, specifying --target also didn't help.

Thank you for trying to help, your --target suggestion made we wonder where do temporary files get written and helped me figure out the above. I'll bother the IT support some more :-)

janamach commented 3 years ago

The IT support recommended removing the .git directory, that solved the strange problem and pip install . ran as expected afterwards.

In the docs you mention that the MATLAB engine for python does not need to be installed on HPC, but whatever command I try to run, it complains about no module named 'matlab':

(antrax) [fr_jm1121@uc2n997 anTraX-data]$ antrax track JS16/ --hpc
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 5, in <module>
    from antrax.cli import main
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 5, in <module>
    from .matlab import *
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 36, in <module>
    import matlab.engine
ModuleNotFoundError: No module named 'matlab'

What am I missing?

asafgal commented 3 years ago

Great, glad it was resolved.

Did you set the ANTRAX_USE_MCR variable?

export ANTRAX_USE_MCR=True
janamach commented 3 years ago

Ah, I should have known this one! That particular HPC has Matlab installed. Since I am quite sure my request to install MCR would be denied, I installed matlab engine for python using this command: cd "matlabroot/extern/engines/python" python setup.py build --build-base=$HOME/tmp/build install They only have R2019b, R2020a, and R2020b. I hope 2019b will work for this. Now to figure our how to run batch jobs...

Thanks again for helping :-)

asafgal commented 3 years ago

There are some compatibility issues with 2019b, but I don't remember if its in the GUI or in the actually tracking code, I guess you'll see soon enough. Anyhow, on all clusters I know you can install software like MCR in your home directory (or other volumes), without requiring system wide installation, so you can try that if you find 2019b crashes. Also, if your cluster uses slurm as a scheduling system, antrax will handle the batch job creation and submission for you...

janamach commented 3 years ago

There are some compatibility issues with 2019b, but I don't remember if its in the GUI or in the actually tracking code

Maybe that is what I am observing. I couldn't start the GUI with X forwarding because of this error:

(antrax) [fr_jm1121@uc2n995 anTraX-data]$ antrax configure JS16/

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

19/03/21 15:07:37 -D- antrax cli entry point
Caught "std::exception" Exception message is:
Bundle#243 start failed: /pfs/data5/software_uc2/bwhpc/common/math/matlab/R2019b/bin/glnxa64/builtins/sl_services/mwlibmwsl_services_builtinimpl.so: failed to map segment from shared object
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in <module>
    sys.exit(main())
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main
    """)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in __call__
    return self.func(*args, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run
    ret = cli(*args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli
    return func('{0} {1}'.format(name, command), *args)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in __call__
    return func(*posargs, **kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 104, in configure
    launch_matlab_app('antrax', args, mcr=mcr)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 202, in launch_matlab_app
    eng = start_matlab()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 136, in start_matlab
    eng.addpath(p, nargout=0)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 71, in __call__
    _stderr, feval=True).result()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/futureresult.py", line 67, in result
    return self.__future.result(timeout)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/fevalfuture.py", line 82, in result
    self._result = pythonengine.getFEvalResult(self._future,self._nargout, None, out=self._out, err=self._err)
matlab.engine.MatlabExecutionError: Bundle#243 start failed: /pfs/data5/software_uc2/bwhpc/common/math/matlab/R2019b/bin/glnxa64/builtins/sl_services/mwlibmwsl_services_builtinimpl.so: failed to map segment from shared object

And track was also terminated, but I am not sure if it's because I tried running it directly in the login server:

(antrax) [fr_jm1121@uc2n995 anTraX-data]$ antrax track JS16/ --nw 6

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

19/03/21 15:13:54 -I- Starting 6 workers
19/03/21 15:14:02 -I- Started track movie 1
19/03/21 15:14:02 -I- Started track movie 2
19/03/21 15:14:02 -I- Started track movie 3
19/03/21 15:14:03 -I- Started track movie 4
19/03/21 15:14:03 -I- Started track movie 5
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 224, in worker
    eng = start_matlab() if not self.mcr else None
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 136, in start_matlab
    eng.addpath(p, nargout=0)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 71, in __call__
    _stderr, feval=True).result()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/futureresult.py", line 67, in result
    return self.__future.result(timeout)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/fevalfuture.py", line 82, in result
    self._result = pythonengine.getFEvalResult(self._future,self._nargout, None, out=self._out, err=self._err)
matlab.engine.EngineError: MATLAB function cannot be evaluated

Exception ignored in: <bound method MatlabEngine.__del__ of <matlab.engine.matlabengine.MatlabEngine object at 0x14dd5c467320>>
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 250, in __del__
    self.exit()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 232, in exit
    pythonengine.closeMATLAB(self.__dict__["_matlab"])
SystemError: MATLAB process cannot be terminated.
19/03/21 15:14:04 -I- Finished track movie 3
19/03/21 15:14:04 -I- Started track movie 6
19/03/21 15:14:04 -I- Finished track movie 2
19/03/21 15:14:04 -I- Finished track movie 1
19/03/21 15:14:04 -I- Finished track movie 5
19/03/21 15:14:04 -I- Finished track movie 6
19/03/21 15:14:04 -I- Finished track movie 4
19/03/21 15:14:04 -I- Started link scross movies
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 237, in worker
    eng.quit()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 240, in quit
    self.exit()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 232, in exit
    pythonengine.closeMATLAB(self.__dict__["_matlab"])
SystemError: MATLAB process cannot be terminated.

19/03/21 15:14:04 -I- Finished link scross movies
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 237, in worker
    eng.quit()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 240, in quit
    self.exit()
  File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 232, in exit
    pythonengine.closeMATLAB(self.__dict__["_matlab"])
SystemError: MATLAB process cannot be terminated.

19/03/21 15:14:04 -I- Workers closed

I will look into installing MCR locally, maybe that's a safer option since we know it works.

asafgal commented 3 years ago

In principle, I don’t recommend running GUIs on clusters. The login nodes are usually crowded and not efficient. I think they also sometimes block interactive applications. The recommended workflow is to configure a session on a local machine, sync the data to the hpc, track, and sync the data back.

I’m not sure why the track command fails. It seems to error already at the first matlab command. You can try looking at the matlab logs in session/logs and see if there is more useful information there.

On Mar 19, 2021, at 4:49 PM, Jana Mach @.***> wrote:

 There are some compatibility issues with 2019b, but I don't remember if its in the GUI or in the actually tracking code

Maybe that is what I am observing. I couldn't start the GUI with X forwarding because of this error:

(antrax) @.*** anTraX-data]$ antrax configure JS16/

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

19/03/21 15:07:37 -D- antrax cli entry point Caught "std::exception" Exception message is: Bundle#243 start failed: /pfs/data5/software_uc2/bwhpc/common/math/matlab/R2019b/bin/glnxa64/builtins/sl_services/mwlibmwsl_services_builtinimpl.so: failed to map segment from shared object Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/bin/antrax", line 8, in sys.exit(main()) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 651, in main """) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/sigtools/modifiers.py", line 158, in call return self.func(*args, kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 363, in run ret = cli(args) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in call return func(posargs, kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 262, in _cli return func('{0} {1}'.format(name, command), args) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/clize/runner.py", line 220, in call return func(posargs, **kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/cli.py", line 104, in configure launch_matlab_app('antrax', args, mcr=mcr) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 202, in launch_matlab_app eng = start_matlab() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 136, in start_matlab eng.addpath(p, nargout=0) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 71, in call _stderr, feval=True).result() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/futureresult.py", line 67, in result return self.__future.result(timeout) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/fevalfuture.py", line 82, in result self._result = pythonengine.getFEvalResult(self._future,self._nargout, None, out=self._out, err=self._err) matlab.engine.MatlabExecutionError: Bundle#243 start failed: /pfs/data5/software_uc2/bwhpc/common/math/matlab/R2019b/bin/glnxa64/builtins/sl_services/mwlibmwsl_services_builtinimpl.so: failed to map segment from shared object And track was also terminated, but I am not sure if it's because I tried running it directly in the login server:

(antrax) @.*** anTraX-data]$ antrax track JS16/ --nw 6

==================================================================================

Welcome to anTraX - a software for tracking color tagged ants (and other insects)

==================================================================================

19/03/21 15:13:54 -I- Starting 6 workers 19/03/21 15:14:02 -I- Started track movie 1 19/03/21 15:14:02 -I- Started track movie 2 19/03/21 15:14:02 -I- Started track movie 3 19/03/21 15:14:03 -I- Started track movie 4 19/03/21 15:14:03 -I- Started track movie 5 Exception in thread Thread-2: Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 224, in worker eng = start_matlab() if not self.mcr else None File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 136, in start_matlab eng.addpath(p, nargout=0) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 71, in call _stderr, feval=True).result() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/futureresult.py", line 67, in result return self.__future.result(timeout) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/fevalfuture.py", line 82, in result self._result = pythonengine.getFEvalResult(self._future,self._nargout, None, out=self._out, err=self._err) matlab.engine.EngineError: MATLAB function cannot be evaluated

Exception ignored in: <bound method MatlabEngine.del of <matlab.engine.matlabengine.MatlabEngine object at 0x14dd5c467320>> Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 250, in del self.exit() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 232, in exit pythonengine.closeMATLAB(self.dict["_matlab"]) SystemError: MATLAB process cannot be terminated. 19/03/21 15:14:04 -I- Finished track movie 3 19/03/21 15:14:04 -I- Started track movie 6 19/03/21 15:14:04 -I- Finished track movie 2 19/03/21 15:14:04 -I- Finished track movie 1 19/03/21 15:14:04 -I- Finished track movie 5 19/03/21 15:14:04 -I- Finished track movie 6 19/03/21 15:14:04 -I- Finished track movie 4 19/03/21 15:14:04 -I- Started link scross movies Exception in thread Thread-3: Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 237, in worker eng.quit() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 240, in quit self.exit() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 232, in exit pythonengine.closeMATLAB(self.dict["_matlab"]) SystemError: MATLAB process cannot be terminated.

19/03/21 15:14:04 -I- Finished link scross movies Exception in thread Thread-1: Traceback (most recent call last): File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/antrax/matlab.py", line 237, in worker eng.quit() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 240, in quit self.exit() File "/home/fr/fr_fr/fr_jm1121/anaconda3/envs/antrax/lib/python3.6/site-packages/matlab/engine/matlabengine.py", line 232, in exit pythonengine.closeMATLAB(self.dict["_matlab"]) SystemError: MATLAB process cannot be terminated.

19/03/21 15:14:04 -I- Workers closed I will look into installing MCR locally, maybe that's a safer option since we know it works.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

janamach commented 3 years ago

I know, I know. This was just my way of testing whether what I installed there works before I start a batch job. I am already using the recommended method on a local server, there I was able to convince the IT to install MCR 2019a and ffmpeg for me. They don't seem to have a queuing system, I wonder if they will complain at some point.

The logs in the HPC server are not happy:

Caught "std::exception" Exception message is:
Message Catalog MATLAB:class was not loaded from the file. Please check file location, format or contents

I think I will explore the MCR local installation option.

janamach commented 3 years ago

As always, your advice is very useful. I installed MCR 2019a in my $HOME directory and antrax started behaving as I expected it to. I ran the demo tracking in the HPC mode with more or less default parameters and it finished successfully in a couple of minutes.

I want to use HPC for running very long jobs. An experiment I have in mind right now is 60 hours of video divided into 5 video files, 11GB each. What settings would you recommend to use for tracking (number of cpu's, throttle, matlab workers)? And is there a way to predict how much time would be needed for such a job to complete other than checking the logs for how much time is typically spend per 1000 frames?

asafgal commented 3 years ago

By design, anTraX parallelize the tracking execution by video. So if you have 5 videos, you can execute 5 parallel tasks in each step of the tracking. Your individual videos are very long, so it might take a while (see below). If you can segment to experiment to shorter videos it will speed up things considerably. Long videos also create very large tracking data files (track data and cropped images), which might create problems in accessing the data in later steps. However, this also depends on the number of actual "track rate" in the experiment. If I understand correctly your experiment, your tracks are very sparse, so it might not be an issue. There is an "undocumented" option in the program to break down videos during the execution, so each video will be processed by a few threads, but it might be a bit buggy so its better not to use it unless you actually have issues, and manually segmenting the videos is not feasible for some reason.

The actual run time depends on many experiment-specific factors, like the number of ants per frame, the typical distance between ants (when two ants are close together, anTraX uses optical flow to link blobs in consecutive frames, which is computationally expensive), the video resolution exc. You can test run a small segment (say 1000 frames), and use it to estimate the total run time, assuming the tracking complexity is stationary.

The classify step scales with the number of tracklets created times the average tracklet length, and the solve step scales with the number of tracklets and the tracklet graph complexity (which is not very complex in your case I think).

As for resource allocation - for the initial track step, 2 cpus per task is enough, as the program creates two threads, one for reading frames from the video and the other for processing them. 4-5GB per task is also more then enough. The throttle (how many tasks in parallel) are up to your HPC rules, if you indeed have only 5 videos, then you dont need more than that.

The classify step is more variable in its resource consumption. Each anTraX task is classifying tracklets from one video, and TensorFlow handles the parallelization within a video, which depends much on the structure of the data (the size of the ant blobs, and the typical frames per tracklets). I usually use 6 cpus and 6GB. you can run a test and see the resource consumption of the process, and use it as an estimate.

The solve step is a single thread per video, usually 2CPU/2GB is enough. If you use anTraX slurm interface, you'll see it actually spawns more tasks than videos, as it first solve each videos separately, then run a task to stitch tracks between the videos, and finally again run a data export step for each video.

janamach commented 3 years ago

I've been running the tracking step for a few days already on a compute server (non-HPC) and it should be done tomorrow. I would run the next steps on the HPC server. If I understand this correctly, once the tracking step is done, it doesn't matter if the data was stored in single very large videos or in many smaller videos, is that right?

asafgal commented 3 years ago

Well, it kinda does, just because anTraX uses the per-video data storage and parallelization for the next steps. But the later steps should be much quicker, so if you are almost done with tracking, there is no reason to change things now.

Good luck, will be happy to hear what came out of this!

janamach commented 3 years ago

No, you're right, tracking single huge videos was not a good idea. First, it took days to track then, then I got stuck in the later steps because all commands took too long. Not very efficient.

In the meantime, I segmented my videos into 5 minute clips and started tracking them on the HPC server. After some playing around with parameters, I found the settings that allow me to process up to 3-4 hours of video per (human) hour. That is indeed much more efficient. I think next time I will go for 15 min segments, 5 seems too short.

Thanks again!