predict_all generates multiprocessing jobs that can fail due to long directory names

ntemiq commented 10 months ago

Multiprocessing is failing when long directory names are passed. Due to the shared infrastructure we're working with, as well as the additional directories generated by at-cascade, it's likely that a directory that exceeds the limit (around 100 characters) will be passed sometimes.

Traceback example:

Process SyncManager-3:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run() File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/lib/python3.11/multiprocessing/managers.py", line 592, in _run_server server = cls._Server(registry, address, authkey, serializer) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/managers.py", line 156, in init self.listener = Listener(address=address, backlog=16) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/connection.py", line 447, in init self._listener = SocketListener(address, family, backlog) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/connection.py", line 590, in init self._socket.bind(address) OSError: AF_UNIX path too long Traceback (most recent call last): File "", line 1, in File "/{path redacted}/at_cascade/csv/predict.py", line 867, in predict predict_all(fit_dir, sim_dir, File "/{path redacted}/at_cascade/csv/predict.py", line 531, in predict_all manager = multiprocessing.Manager() ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/context.py", line 57, in Manager m.start() File "/usr/lib/python3.11/multiprocessing/managers.py", line 567, in start self._address = reader.recv() ^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/connection.py", line 249, in recv buf = self._recv_bytes() ^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/connection.py", line 413, in _recv_bytes buf = self._recv(4) ^^^^^^^^^^^^^ File "/usr/lib/python3.11/multiprocessing/connection.py", line 382, in _recv raise EOFError EOFError

The issue appears to come in the predict_all() method: https://github.com/bradbell/at_cascade/blob/499a7b680a387469af2a659c7b747acea69ed0f3/at_cascade/csv/predict.py#L440C4-L440C11

I suspect a quick solution would be to set the working directory to the fit directory and replace {fit_dir} with "./" to use relative subdirectories of fit_dir but would need to test if changing the working directory or replacing all instances of {fit_dir} in that function would cause issues.

bradbell commented 10 months ago

I suspect that long paths are not the problem. Note that all the the at_cascade examples https://at-cascade.readthedocs.io/example.html can be run from the git top source directory. To test your hypothesis above, I created the following script (temp.sh):

#! /usr/bin/env bash
set -e -u
if [ ! -e './.git' ]
then
   echo './temp.sh: must be executed from top directory of git repository'
   exit 1
fi
#
# ./
git reset --hard
if [ -e build/example/csv ]
then
   rm -r build/example/csv
fi
#
# file 
file='example/csv/predict_xam.py'
#
# node_name
node_name='n1'
for i in {1..20}
do
   node_name="${node_name}_long_name"
done
#
# file
sed -i $file \
   -e 's|max_number_cpu,1|max_number_cpu,2|' \
   -e "s|n1|$node_name|"
#
# build/example/csv
python3 $file
#
# check
echo "ls build/example/csv/fit/n0/female/$node_name"
ls "build/example/csv/fit/n0/female/$node_name"
#
exit 0

Running this script on my machine gave the following results:

at_cascade.git>./temp.sh
HEAD is now at 08b4962 master: Advance to at_cascade-2024.1.17
Reading csv files:
Creating data structures:
Simulation: total id = 180
End simulation: total seconds = 0
Write files
csv.simulate done
begin reading csv files
begin creating root node database
Begin: 03:46:20: no_ode fit both
End:   03:46:20: no_ode
create: bradbell_n0.both shared memory
Begin: 03:46:20: fit both  n0.both
End:   03:46:22: fit both  n0.both
       {'wait': 0, 'ready': 4, 'run': 0, 'done': 1, 'error': 0, 'abort': 0}
Begin: 03:46:22: fit both  n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female
Begin: 03:46:22: fit both  n2.female
Begin: 03:46:22: fit both  n2.male
Begin: 03:46:22: fit both  n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male
End:   03:46:24: fit both  n2.female
       {'wait': 0, 'ready': 0, 'run': 3, 'done': 2, 'error': 0, 'abort': 0}
End:   03:46:24: fit both  n2.male
       {'wait': 0, 'ready': 0, 'run': 2, 'done': 3, 'error': 0, 'abort': 0}
End:   03:46:24: fit both  n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male
End:   03:46:24: fit both  n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female
remove: bradbell_n0.both shared memory
       {'wait': 0, 'ready': 0, 'run': 0, 'done': 5, 'error': 0, 'abort': 0}
       {'wait': 0, 'ready': 0, 'run': 0, 'done': 5, 'error': 0, 'abort': 0}
Predict: n_job = 5, n_spawn = 1
Begin: 03:46:24: predict n0.both
Begin: 03:46:25: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female
End:   03:46:27: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female 1/5
Begin: 03:46:27: predict n2.female
End:   03:46:27: predict n0.both 2/5
Begin: 03:46:27: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male
End:   03:46:29: predict n2.female 3/5
Begin: 03:46:29: predict n2.male
End:   03:46:29: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male 4/5
End:   03:46:30: predict n2.male 5/5
csv_predict_xam: OK
ls build/example/csv/fit/n0/female/n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name
age_avg.csv  hes_fixed.csv       option.csv      simulate.csv
covariate.csv    hes_random.csv      option_sim.csv  trace_fixed.csv
data.csv     log.csv         option_sim_out.csv  trace.out
data_plot.pdf    mixed_info.csv      predict.csv     tru_predict.csv
data_sim.csv     multiplier_sim.csv  random_effect.csv   variable.csv
dismod.db    node.csv        rate_plot.pdf
fit_predict.csv  no_effect_rate.csv  sam_predict.csv
at_cascade.git>

bradbell commented 10 months ago

I suspect a quick solution would be to set the working directory to the fit directory and replace {fit_dir} with "./" to use relative subdirectories of fit_dir but would need to test if changing the working directory or replacing all instances of {fit_dir} in that function would cause issues.

You could test if this helps by changing the current directory to be the fit_dir and then using . for fit_dir in the call to at_cascade.

ntemiq commented 9 months ago

You're right, of course, this doesn't throw the error we've seen. This might be an issue with our implementation/local build. I'm going to close this issue, I'll open a new one if investigation determines this is an issue with at_cascade and not our implementation.

ntemiq commented 9 months ago

Can't reproduce in fresh docker container. Investigating our implementation.

bradbell / at_cascade

predict_all generates multiprocessing jobs that can fail due to long directory names #11