Closed tcompa closed 2 years ago
A likely reason for weird behaviors is the one described below (namely having the wrong server version installed). I'll now test whether fixing it already solves the current issue, or whether there is something else.
I noticed that pip install
from git sometimes installs an older version than the last commit (even if it says that it's cloning the repo up to the most recent one).
$ python -c "import fractal_server; print(fractal_server.__file__)"
/SOME/PATH/lib/python3.8/site-packages/fractal_server/__init__.py
and by looking in that folder one would see an old version (e.g. a version of /SOME/PATH/lib/python3.8/site-packages/fractal_server/app/runner/runner_utils.py
which still has cpu-2
in the local
config).
Add --force-reinstall
to the pip install git ...
command.
For the record, after updating the install (on my local machine, and with PARSL_CONFIG=local
), I can run an example where tasks execute on cpu-low
, gpu
, cpu-low
and gpu
(see fractal.log
below).
I'll now test this on SLURM.
Here is fractal.log
:
$ tailf fractal.log
2022-09-26 09:33:12,415; INFO; ********************************************************************************
2022-09-26 09:33:12,415; INFO; fractal_server.__VERSION__: 0.1.3
2022-09-26 09:33:12,415; INFO; Start workflow My workflow 1
2022-09-26 09:33:12,415; INFO; input_paths=[PosixPath('../images/10.5281_zenodo.7059515/*.png')]
2022-09-26 09:33:12,415; INFO; output_path=PosixPath('/home/tommaso/Fractal/fractal/examples/01_cardio_tiny_dataset_with_fake_executors/myproj-1/output/*.zarr')
2022-09-26 09:33:12,415; INFO; settings.PARSL_CONFIG='local'
2022-09-26 09:33:12,415; INFO; settings.PARSL_DEFAULT_EXECUTOR='cpu-low'
2022-09-26 09:33:12,416; INFO; DFK probably missing, proceed with parsl.clear and parsl.config.Config
2022-09-26 09:33:12,567; INFO; DFK <parsl.dataflow.dflow.DataFlowKernel object at 0x7f31fab64eb0> now has 5 executors: ['10___cpu-low', '10___cpu-mid', '10___cpu-high', '10___gpu', '_parsl_internal']
2022-09-26 09:33:12,573; INFO; Starting "Create OME-ZARR structure" task on "10___cpu-low" executor.
2022-09-26 09:33:14,079; INFO; Starting "Yokogawa to Zarr" task on "10___gpu" executor.
2022-09-26 09:33:16,532; INFO; Starting "Replicate Zarr structure" task on "10___cpu-low" executor.
2022-09-26 09:33:17,726; INFO; Starting "Maximum Intensity Projection" task on "10___gpu" executor.
This is the example script:
# Register user (this step will change in the future)
curl -d '{"email":"test@me.com", "password":"test"}' -H "Content-Type: application/json" -X POST localhost:8000/auth/register
# Set useful variables
PRJ_NAME="myproj-1"
DS_IN_NAME="input-ds-1"
DS_OUT_NAME="output-ds-1"
WF_NAME="My workflow 1"
# Define/initialize empty folder for temporary files
TMPDIR=`pwd`/$PRJ_NAME
rm -r $TMPDIR
mkdir $TMPDIR
TMPJSON=${TMPDIR}/tmp.json
TMPTASKS=${TMPDIR}/core_tasks.json
INPUT_PATH=../images/10.5281_zenodo.7059515
OUTPUT_PATH=${TMPDIR}/output
CMD="fractal"
CMD_JSON="poetry run python aux_extract_from_simple_json.py $TMPJSON"
CMD_CORE_TASKS="poetry run python aux_extract_id_for_core_task.py $TMPTASKS"
$CMD task list > $TMPTASKS
# Create project
$CMD -j project new $PRJ_NAME $TMPDIR > $TMPJSON
PRJ_ID=`$CMD_JSON id`
DS_IN_ID=`$CMD_JSON id`
echo "PRJ_ID: $PRJ_ID"
echo "DS_IN_ID: $DS_IN_ID"
# Update dataset name/type, and add a resource
$CMD dataset edit --name "$DS_IN_NAME" -t image --read-only $PRJ_ID $DS_IN_ID
$CMD dataset add-resource -g "*.png" $PRJ_ID $DS_IN_ID $INPUT_PATH
# Add output dataset, and add a resource to it
DS_OUT_ID=`$CMD --batch project add-dataset $PRJ_ID "$DS_OUT_NAME"`
$CMD dataset edit -t zarr --read-write $PRJ_ID $DS_OUT_ID
$CMD dataset add-resource -g "*.zarr" $PRJ_ID $DS_OUT_ID $OUTPUT_PATH
# Create workflow
WF_ID=`$CMD --batch task new "$WF_NAME" workflow image zarr`
echo "WF_ID: $WF_ID"
# Add subtasks
SUBTASK_ID=`$CMD_CORE_TASKS "Create OME-ZARR structure"`
echo "{\"num_levels\": 5, \"coarsening_xy\": 2, \"channel_parameters\": {\"A01_C01\": {\"label\": \"DAPI\",\"colormap\": \"00FFFF\",\"start\": 110,\"end\": 800 }, \"A01_C02\": {\"label\": \"nanog\",\"colormap\": \"FF00FF\",\"start\": 110,\"end\": 290 }, \"A02_C03\": {\"label\": \"Lamin B1\",\"colormap\": \"FFFF00\",\"start\": 110,\"end\": 1600 }}}" > ${TMPDIR}/args_create.json
$CMD task add-subtask $WF_ID $SUBTASK_ID --args-file ${TMPDIR}/args_create.json
SUBTASK_ID=`$CMD_CORE_TASKS "Yokogawa to Zarr"`
echo "{\"executor\": \"gpu\"}" > ${TMPDIR}/args_yoko.json
$CMD task add-subtask $WF_ID $SUBTASK_ID --args-file ${TMPDIR}/args_yoko.json
SUBTASK_ID=`$CMD_CORE_TASKS "Replicate Zarr structure"`
$CMD task add-subtask $WF_ID $SUBTASK_ID
SUBTASK_ID=`$CMD_CORE_TASKS "Maximum Intensity Projection"`
echo "{\"executor\": \"gpu\"}" > ${TMPDIR}/args_mip.json
$CMD task add-subtask $WF_ID $SUBTASK_ID --args-file ${TMPDIR}/args_mip.json
# Apply workflow
$CMD task apply $PRJ_ID $DS_IN_ID $DS_OUT_ID $WF_ID
Ah, that sounds promising! Would be good to make an updated installation instruction then. Such installation things are very non-obvious & hard to debug. Great that you figured this one out, let's aim for a setup where we avoid this! :)
Another error we spotted is that the gpu
partition does not allow a request for ~62G of memory. Asking for 61G seems to fix the issue with this partition. This will be fixed in main
with https://github.com/fractal-analytics-platform/fractal-server/pull/81.
I could now run the example in examples/02_cardio_tiny_dataset_with_gpu
, and the server's fractal.log
reads
2022-09-26 10:33:27,421; INFO; ********************************************************************************
2022-09-26 10:33:27,421; INFO; fractal_server.__VERSION__: 0.1.3
2022-09-26 10:33:27,421; INFO; Start workflow My workflow gpu
2022-09-26 10:33:27,421; INFO; input_paths=[PosixPath('../images/10.5281_zenodo.7059515/*.png')]
2022-09-26 10:33:27,421; INFO; output_path=PosixPath('/data/homes/fractal/fractal/examples/02_cardio_tiny_dataset_with_gpu/myproj-gpu/output/*.zarr')
2022-09-26 10:33:27,421; INFO; settings.PARSL_CONFIG='pelkmanslab'
2022-09-26 10:33:27,421; INFO; settings.PARSL_DEFAULT_EXECUTOR='cpu-low'
2022-09-26 10:33:27,424; INFO; DFK probably missing, proceed with parsl.clear and parsl.config.Config
2022-09-26 10:33:27,710; INFO; DFK <parsl.dataflow.dflow.DataFlowKernel object at 0x7fc2a096b310> now has 5 executors: ['10___cpu-low', '10___cpu-mid', '10___cpu-high', '10___gpu', '_parsl_internal']
2022-09-26 10:33:27,720; INFO; Starting "Create OME-ZARR structure" task on "10___cpu-low" executor.
2022-09-26 10:33:39,215; INFO; Starting "Yokogawa to Zarr" task on "10___cpu-low" executor.
2022-09-26 10:33:47,616; INFO; Starting "Per-FOV image labeling" task on "10___gpu" executor.
2022-09-26 10:35:23,075; INFO; Starting "Replicate Zarr structure" task on "10___cpu-low" executor.
2022-09-26 10:35:30,008; INFO; Starting "Maximum Intensity Projection" task on "10___cpu-low" executor.
where we can see the correct list of executors: cpu-low
, cpu-low
, gpu
, cpu-low
, cpu-low
.
In principle this can be closed.
This is now achieved with
pip install fractal-tasks-core==0.1.5 fractal-client==0.2.1 fractal-server==0.1.4
(where the server one was just released)
@jluethi I think this issue is closed, right?
Yes, it was mostly about whether we could run jobs on the GPU, which seems to work nice now :)
I think the issue is not directly related to the labeling tasks. Even other tasks that run on a GPU node never seem to finish and potentially seem to trigger some server errors. See https://github.com/fractal-analytics-platform/fractal/issues/262 for examples of how
Replicate Zarr structure
also never finishes when running on a GPU node