Closed ntemiq closed 9 months ago
I suspect that long paths are not the problem. Note that all the the at_cascade examples https://at-cascade.readthedocs.io/example.html can be run from the git top source directory. To test your hypothesis above, I created the following script (temp.sh):
#! /usr/bin/env bash
set -e -u
if [ ! -e './.git' ]
then
echo './temp.sh: must be executed from top directory of git repository'
exit 1
fi
#
# ./
git reset --hard
if [ -e build/example/csv ]
then
rm -r build/example/csv
fi
#
# file
file='example/csv/predict_xam.py'
#
# node_name
node_name='n1'
for i in {1..20}
do
node_name="${node_name}_long_name"
done
#
# file
sed -i $file \
-e 's|max_number_cpu,1|max_number_cpu,2|' \
-e "s|n1|$node_name|"
#
# build/example/csv
python3 $file
#
# check
echo "ls build/example/csv/fit/n0/female/$node_name"
ls "build/example/csv/fit/n0/female/$node_name"
#
exit 0
Running this script on my machine gave the following results:
at_cascade.git>./temp.sh
HEAD is now at 08b4962 master: Advance to at_cascade-2024.1.17
Reading csv files:
Creating data structures:
Simulation: total id = 180
End simulation: total seconds = 0
Write files
csv.simulate done
begin reading csv files
begin creating root node database
Begin: 03:46:20: no_ode fit both
End: 03:46:20: no_ode
create: bradbell_n0.both shared memory
Begin: 03:46:20: fit both n0.both
End: 03:46:22: fit both n0.both
{'wait': 0, 'ready': 4, 'run': 0, 'done': 1, 'error': 0, 'abort': 0}
Begin: 03:46:22: fit both n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female
Begin: 03:46:22: fit both n2.female
Begin: 03:46:22: fit both n2.male
Begin: 03:46:22: fit both n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male
End: 03:46:24: fit both n2.female
{'wait': 0, 'ready': 0, 'run': 3, 'done': 2, 'error': 0, 'abort': 0}
End: 03:46:24: fit both n2.male
{'wait': 0, 'ready': 0, 'run': 2, 'done': 3, 'error': 0, 'abort': 0}
End: 03:46:24: fit both n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male
End: 03:46:24: fit both n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female
remove: bradbell_n0.both shared memory
{'wait': 0, 'ready': 0, 'run': 0, 'done': 5, 'error': 0, 'abort': 0}
{'wait': 0, 'ready': 0, 'run': 0, 'done': 5, 'error': 0, 'abort': 0}
Predict: n_job = 5, n_spawn = 1
Begin: 03:46:24: predict n0.both
Begin: 03:46:25: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female
End: 03:46:27: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.female 1/5
Begin: 03:46:27: predict n2.female
End: 03:46:27: predict n0.both 2/5
Begin: 03:46:27: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male
End: 03:46:29: predict n2.female 3/5
Begin: 03:46:29: predict n2.male
End: 03:46:29: predict n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name.male 4/5
End: 03:46:30: predict n2.male 5/5
csv_predict_xam: OK
ls build/example/csv/fit/n0/female/n1_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name_long_name
age_avg.csv hes_fixed.csv option.csv simulate.csv
covariate.csv hes_random.csv option_sim.csv trace_fixed.csv
data.csv log.csv option_sim_out.csv trace.out
data_plot.pdf mixed_info.csv predict.csv tru_predict.csv
data_sim.csv multiplier_sim.csv random_effect.csv variable.csv
dismod.db node.csv rate_plot.pdf
fit_predict.csv no_effect_rate.csv sam_predict.csv
at_cascade.git>
I suspect a quick solution would be to set the working directory to the fit directory and replace {fit_dir} with "./" to use relative subdirectories of fit_dir but would need to test if changing the working directory or replacing all instances of {fit_dir} in that function would cause issues.
You could test if this helps by changing the current directory to be the fit_dir and then using .
for fit_dir in the call to at_cascade.
You're right, of course, this doesn't throw the error we've seen. This might be an issue with our implementation/local build. I'm going to close this issue, I'll open a new one if investigation determines this is an issue with at_cascade and not our implementation.
Can't reproduce in fresh docker container. Investigating our implementation.
Multiprocessing is failing when long directory names are passed. Due to the shared infrastructure we're working with, as well as the additional directories generated by at-cascade, it's likely that a directory that exceeds the limit (around 100 characters) will be passed sometimes.
Traceback example:
The issue appears to come in the predict_all() method: https://github.com/bradbell/at_cascade/blob/499a7b680a387469af2a659c7b747acea69ed0f3/at_cascade/csv/predict.py#L440C4-L440C11
I suspect a quick solution would be to set the working directory to the fit directory and replace {fit_dir} with "./" to use relative subdirectories of fit_dir but would need to test if changing the working directory or replacing all instances of {fit_dir} in that function would cause issues.