fractal-analytics-platform / fractal-server

Fractal backend
https://fractal-analytics-platform.github.io/fractal-server/
BSD 3-Clause "New" or "Revised" License
11 stars 3 forks source link

Review labeling task with new server config #76

Closed tcompa closed 2 years ago

jluethi commented 2 years ago

I think the issue is not directly related to the labeling tasks. Even other tasks that run on a GPU node never seem to finish and potentially seem to trigger some server errors. See https://github.com/fractal-analytics-platform/fractal/issues/262 for examples of how Replicate Zarr structure also never finishes when running on a GPU node

tcompa commented 2 years ago

A likely reason for weird behaviors is the one described below (namely having the wrong server version installed). I'll now test whether fixing it already solves the current issue, or whether there is something else.

Error

I noticed that pip install from git sometimes installs an older version than the last commit (even if it says that it's cloning the repo up to the most recent one).

How to check for this error:

$ python -c "import fractal_server; print(fractal_server.__file__)"
/SOME/PATH/lib/python3.8/site-packages/fractal_server/__init__.py

and by looking in that folder one would see an old version (e.g. a version of /SOME/PATH/lib/python3.8/site-packages/fractal_server/app/runner/runner_utils.py which still has cpu-2 in the local config).

Quick fix

Add --force-reinstall to the pip install git ... command.

tcompa commented 2 years ago

For the record, after updating the install (on my local machine, and with PARSL_CONFIG=local), I can run an example where tasks execute on cpu-low, gpu, cpu-low and gpu (see fractal.log below). I'll now test this on SLURM.

Here is fractal.log:

$ tailf fractal.log 
2022-09-26 09:33:12,415; INFO; ********************************************************************************
2022-09-26 09:33:12,415; INFO; fractal_server.__VERSION__: 0.1.3
2022-09-26 09:33:12,415; INFO; Start workflow My workflow 1
2022-09-26 09:33:12,415; INFO; input_paths=[PosixPath('../images/10.5281_zenodo.7059515/*.png')]
2022-09-26 09:33:12,415; INFO; output_path=PosixPath('/home/tommaso/Fractal/fractal/examples/01_cardio_tiny_dataset_with_fake_executors/myproj-1/output/*.zarr')
2022-09-26 09:33:12,415; INFO; settings.PARSL_CONFIG='local'
2022-09-26 09:33:12,415; INFO; settings.PARSL_DEFAULT_EXECUTOR='cpu-low'
2022-09-26 09:33:12,416; INFO; DFK probably missing, proceed with parsl.clear and parsl.config.Config
2022-09-26 09:33:12,567; INFO; DFK <parsl.dataflow.dflow.DataFlowKernel object at 0x7f31fab64eb0> now has 5 executors: ['10___cpu-low', '10___cpu-mid', '10___cpu-high', '10___gpu', '_parsl_internal']
2022-09-26 09:33:12,573; INFO; Starting "Create OME-ZARR structure" task on "10___cpu-low" executor.
2022-09-26 09:33:14,079; INFO; Starting "Yokogawa to Zarr" task on "10___gpu" executor.
2022-09-26 09:33:16,532; INFO; Starting "Replicate Zarr structure" task on "10___cpu-low" executor.
2022-09-26 09:33:17,726; INFO; Starting "Maximum Intensity Projection" task on "10___gpu" executor.

This is the example script:

# Register user (this step will change in the future)
curl -d '{"email":"test@me.com", "password":"test"}' -H "Content-Type: application/json" -X POST localhost:8000/auth/register

# Set useful variables
PRJ_NAME="myproj-1"
DS_IN_NAME="input-ds-1"
DS_OUT_NAME="output-ds-1"
WF_NAME="My workflow 1"

# Define/initialize empty folder for temporary files
TMPDIR=`pwd`/$PRJ_NAME
rm -r $TMPDIR
mkdir $TMPDIR
TMPJSON=${TMPDIR}/tmp.json
TMPTASKS=${TMPDIR}/core_tasks.json

INPUT_PATH=../images/10.5281_zenodo.7059515
OUTPUT_PATH=${TMPDIR}/output

CMD="fractal"
CMD_JSON="poetry run python aux_extract_from_simple_json.py $TMPJSON"
CMD_CORE_TASKS="poetry run python aux_extract_id_for_core_task.py $TMPTASKS"
$CMD task list > $TMPTASKS

# Create project
$CMD -j project new $PRJ_NAME $TMPDIR > $TMPJSON
PRJ_ID=`$CMD_JSON id`
DS_IN_ID=`$CMD_JSON id`
echo "PRJ_ID: $PRJ_ID"
echo "DS_IN_ID: $DS_IN_ID"

# Update dataset name/type, and add a resource
$CMD dataset edit --name "$DS_IN_NAME" -t image --read-only $PRJ_ID $DS_IN_ID
$CMD dataset add-resource -g "*.png" $PRJ_ID $DS_IN_ID $INPUT_PATH

# Add output dataset, and add a resource to it
DS_OUT_ID=`$CMD --batch project add-dataset $PRJ_ID "$DS_OUT_NAME"`
$CMD dataset edit -t zarr --read-write $PRJ_ID $DS_OUT_ID
$CMD dataset add-resource -g "*.zarr" $PRJ_ID $DS_OUT_ID $OUTPUT_PATH

# Create workflow
WF_ID=`$CMD --batch task new "$WF_NAME" workflow image zarr`
echo "WF_ID: $WF_ID"

# Add subtasks

SUBTASK_ID=`$CMD_CORE_TASKS "Create OME-ZARR structure"`
echo "{\"num_levels\": 5, \"coarsening_xy\": 2, \"channel_parameters\": {\"A01_C01\": {\"label\": \"DAPI\",\"colormap\": \"00FFFF\",\"start\": 110,\"end\": 800 }, \"A01_C02\": {\"label\": \"nanog\",\"colormap\": \"FF00FF\",\"start\": 110,\"end\": 290 }, \"A02_C03\": {\"label\": \"Lamin B1\",\"colormap\": \"FFFF00\",\"start\": 110,\"end\": 1600 }}}" > ${TMPDIR}/args_create.json
$CMD task add-subtask $WF_ID $SUBTASK_ID --args-file ${TMPDIR}/args_create.json

SUBTASK_ID=`$CMD_CORE_TASKS "Yokogawa to Zarr"`
echo "{\"executor\": \"gpu\"}" > ${TMPDIR}/args_yoko.json
$CMD task add-subtask $WF_ID $SUBTASK_ID --args-file ${TMPDIR}/args_yoko.json

SUBTASK_ID=`$CMD_CORE_TASKS "Replicate Zarr structure"`
$CMD task add-subtask $WF_ID $SUBTASK_ID

SUBTASK_ID=`$CMD_CORE_TASKS "Maximum Intensity Projection"`
echo "{\"executor\": \"gpu\"}" > ${TMPDIR}/args_mip.json
$CMD task add-subtask $WF_ID $SUBTASK_ID --args-file ${TMPDIR}/args_mip.json

# Apply workflow
$CMD task apply $PRJ_ID $DS_IN_ID $DS_OUT_ID $WF_ID
jluethi commented 2 years ago

Ah, that sounds promising! Would be good to make an updated installation instruction then. Such installation things are very non-obvious & hard to debug. Great that you figured this one out, let's aim for a setup where we avoid this! :)

tcompa commented 2 years ago

Another error we spotted is that the gpu partition does not allow a request for ~62G of memory. Asking for 61G seems to fix the issue with this partition. This will be fixed in main with https://github.com/fractal-analytics-platform/fractal-server/pull/81.

tcompa commented 2 years ago

I could now run the example in examples/02_cardio_tiny_dataset_with_gpu, and the server's fractal.log reads

2022-09-26 10:33:27,421; INFO; ********************************************************************************
2022-09-26 10:33:27,421; INFO; fractal_server.__VERSION__: 0.1.3
2022-09-26 10:33:27,421; INFO; Start workflow My workflow gpu
2022-09-26 10:33:27,421; INFO; input_paths=[PosixPath('../images/10.5281_zenodo.7059515/*.png')]
2022-09-26 10:33:27,421; INFO; output_path=PosixPath('/data/homes/fractal/fractal/examples/02_cardio_tiny_dataset_with_gpu/myproj-gpu/output/*.zarr')
2022-09-26 10:33:27,421; INFO; settings.PARSL_CONFIG='pelkmanslab'
2022-09-26 10:33:27,421; INFO; settings.PARSL_DEFAULT_EXECUTOR='cpu-low'
2022-09-26 10:33:27,424; INFO; DFK probably missing, proceed with parsl.clear and parsl.config.Config
2022-09-26 10:33:27,710; INFO; DFK <parsl.dataflow.dflow.DataFlowKernel object at 0x7fc2a096b310> now has 5 executors: ['10___cpu-low', '10___cpu-mid', '10___cpu-high', '10___gpu', '_parsl_internal']
2022-09-26 10:33:27,720; INFO; Starting "Create OME-ZARR structure" task on "10___cpu-low" executor.
2022-09-26 10:33:39,215; INFO; Starting "Yokogawa to Zarr" task on "10___cpu-low" executor.
2022-09-26 10:33:47,616; INFO; Starting "Per-FOV image labeling" task on "10___gpu" executor.
2022-09-26 10:35:23,075; INFO; Starting "Replicate Zarr structure" task on "10___cpu-low" executor.
2022-09-26 10:35:30,008; INFO; Starting "Maximum Intensity Projection" task on "10___cpu-low" executor.

where we can see the correct list of executors: cpu-low, cpu-low, gpu, cpu-low, cpu-low.

In principle this can be closed.

tcompa commented 2 years ago

This is now achieved with

pip install fractal-tasks-core==0.1.5 fractal-client==0.2.1 fractal-server==0.1.4

(where the server one was just released)

tcompa commented 2 years ago

@jluethi I think this issue is closed, right?

jluethi commented 2 years ago

Yes, it was mostly about whether we could run jobs on the GPU, which seems to work nice now :)