JaneliaSciComp / multifish

EASI-FISH analysis pipeline for spatial transcriptomics
BSD 3-Clause "New" or "Revised" License
32 stars 13 forks source link

Error with warp_spots on SGE #2

Closed bradleycolquitt closed 2 years ago

bradleycolquitt commented 2 years ago

I'm running into an error while running the multifish pipeline -- demo_small.sh example script -- on a cluster using the SGE job handler. Everything runs well up to the warp_points process at which point I get this error:

Traceback (most recent call last): File "/app/bigstream/apply_transform_n5.py", line 83, in <module> warped_points[:, 0] = interpolate_image(grid[..., 0], points[:, :3]/vox) IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed, size: 506 (max: 255)

I've run the pipeline succesfully on a local workstation, so I'm not quite sure what the difference is here. Any thoughts?

krokicki commented 2 years ago

That's definitely strange, and nothing obvious jumps out. Could you attach your .nextflow.log file?

bradleycolquitt commented 2 years ago

nextflow.log

Sure, thanks. Should be attached.

krokicki commented 2 years ago

What does your spots file look like (just the first few lines)? /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/spots/LHA3_R5_small-to-LHA3_R3_small/merged_points_c0_warped.txt

I'm guessing that something is failing silently and we're just not catching it. I see that you're running with Tower, so you might want to inspect the output for each process and see if there are any obvious errors upstream.

bradleycolquitt commented 2 years ago

File doesn't exist -- what process outputs this?

The first few lines of /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/spots/merged_points_c0.txt are

-1.000000000000000000e+00 -1.000000000000000000e+00 -1.000000000000000000e+00 0.000000000000000000e+00

Which is different from that file output on my workstation:

9.431411582447934450e+01,2.823798825846286320e+01,7.225134647716262748e+01,1.310606338846846775e+05 1.070275300383600268e+02,8.793380873013641974e+01,1.873568609545958452e+01,1.226970181352744403e+05 8.585554046218531710e+01,8.504635616378210727e+01,5.648002104745076224e+01,1.143576641009118903e+05 1.093293295056259922e+02,8.672272746256190601e+01,4.371599674777299072e+01,1.139350045368172869e+05

No obvious errors in upstream processing. Everything exited OK. But perhaps the merge_points process is failing?

bradleycolquitt commented 2 years ago

Slight update -- the issue seems to be with airlocalize. When running on the development node of our cluster it runs well, in that it takes the expected amount of time. However, when submitting through SGE, the airlocalize process runs too quickly, finishing almost immediately, suggesting some error...

I submitted one previously generated airlocalize .command.run and got this error:

SystemError: Could not access the MATLAB Runtime component cache. Details: fl:filesystem:SystemError; component cache root:; componentname: AIRLOCALIZE_N5

krokicki commented 2 years ago

This might have to do with how the temporary files are being created. Does your cluster have a /scratch directory? In the .command.log, do you see a message starting with "Use MCR_CACHE_ROOT", and is that directory valid on your cluster node?

I think we may need to generalize this to allow the user to set the shared scratch directory. @cgoina can you please look into this?

bradleycolquitt commented 2 years ago

Each node has its own locally accessible /scratch. E.g. for one job, the MCR_CACHE_ROOT reported is /scratch/118917.1.long.q/tmp.0AgH04g6iS (where 1189.. is the job name). Does MCR_CACHE_ROOT need to be globally accessible, or would such node-specific scratches work?

There is a global scratch at /wynton/scratch.

cgoina commented 2 years ago

I know this is a bit late - but do you still have the logs from the failed airlocalize jobs? If not my guess is that '/scratch' directory is not mounted automatically when the job is invoked. This is on purpose because not everybody has the same settings we have at Janelia. I don't understand completely what had happened there but it looked like '/scratch' was not detected because 'tmp. 0AgH04g6iS' looks like a directory created by mktemp which is used when '/scratch' is not found but somehow TMPDIR is set to the job's directory '/scratch/118917.1.long.q'. This should have been OK but maybe there was something else. I will make changes to allow configured scratch dirs but if you get a chance to re-run this please post all logs from one airlocalize jobs from the work directory.

bradleycolquitt commented 2 years ago

Thanks for your help.

Here is a the .command.log from one of the airlocalize jobs.

WARNING: While bind mounting '/wynton/home/brainard/olfaction/multifish/work/76/43e40d17c7ac89a141c63d35f8df04:/wynton/home/brainard/olfaction/multifish/work/76/43e40d17c7ac89a141c63d35f8df04': destination is already in the mount point list /app/airlocalize/airlocalize.sh /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/spots/tiles/0/coords.txt /app/airlocalize/params/air_localize_default_params.txt /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/spots/tiles/0 _c0.txt mkdir: cannot create directory '/scratch': Read-only file system Use MCR_CACHE_ROOT /scratch/118917.1.long.q/tmp.0AgH04g6iS Traceback (most recent call last): File "/app/airlocalize/scripts/air_localize_mcr.py", line 104, in <module> AIRLOCALIZE=AIRLOCALIZE_N5.initialize() File "/miniconda/lib/python3.6/site-packages/AIRLOCALIZE_N5/__init__.py", line 305, in initialize return _pir.initialize_package() File "/miniconda/lib/python3.6/site-packages/AIRLOCALIZE_N5/__init__.py", line 253, in initialize_package package_handle.initialize() File "/usr/local/MATLAB/MATLAB_Runtime/v95/toolbox/compiler_sdk/pysdk_py/matlab_pysdk/runtime/deployablepackage.py", line 33, in initialize mcr_handle = self.__cppext_handle.startMatlabRuntimeInstance(self.__ctf_path) SystemError: Could not access the MATLAB Runtime component cache. Details: fl:filesystem:SystemError; component cache root:; componentname: AIRLOCALIZE_N5 TERM environment variable not set.

And here is .command.out

/app/airlocalize/airlocalize.sh /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/stitching/export.n5 /c0/s0 /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/spots/tiles/0/coords.txt /app/airlocalize/params/air_localize_default_params.txt /wynton/scratch/brainard/olfaction/multifish/demo_small/outputs/LHA3_R5_small/spots/tiles/0 _c0.txt Use MCR_CACHE_ROOT /scratch/118917.1.long.q/tmp.0AgH04g6iS

cgoina commented 2 years ago

The problem is that "/scratch" directory is READ-ONLY. Do you know if your cluster setup is using /scratch/$USER as scratch area? Until the solution is available you can try to run the demoscript like this: TMPDIR=/wynton/scratch/brainard demo_small.sh or you can probably even try TMPDIR=/scratch/brainard demo_small.sh

bradleycolquitt commented 2 years ago

That's strange. My understanding is that each compute node has /scratch which is used for local storage for each job (the admins even specifically recommend using this space due to increased I/O). My read on "mkdir: cannot create directory '/scratch Read-only file system" was that '/' was read-only.

For these runs, I had set TMPDIR in the demo_small.sh to /wynton/scratch/brainard/olfaction, so should have already been using system-wide scratch

cgoina commented 2 years ago

The job was trying to create the directory '/scratch/118917.1.long.q/tmp.0AgH04g6iS'. If you already set TMPDIR in the script I would've thought it would work. Do you mind pasting here or uploading your demo_small.sh script.

Also if you already changed your demo_small.sh script you can also try this change (just add -B /scratch option) - this should work if '/scratch' exists on every node but it will fail if the directory is missing, because singularity fails to bind directories that do not exist:

./main.nf \
        --runtime_opts "--nv -B $BASEDIR -B $datadir -B $TMPDIR -B /scratch --env USER=$USER" \
        --workers "1" \
        --worker_cores "16" \
        --gb_per_core "3" \
        --driver_memory "2g" \
        --channels "c0,c1" \
        --stitching_ref "1" \
        --dapi_channel "c1" \
        --spot_extraction_xy_stride "512" \
        --spot_extraction_z_stride "256" \
        --spot_extraction_cpus "1" \
        --spot_extraction_memory "8" \
        --spark_work_dir "$datadir/spark" \
        --data_dir "$inputdir" \
        --output_dir "$outputdir" \
        --ref_acq "$fixed_round" \
        --acq_names "$fixed_round,$moving_rounds" "$@"
bradleycolquitt commented 2 years ago

Adding -B /scratch worked!

demo_small ran without errors, and the LHA3_R3_small/spots/merged_points_c0.txt looks identical (with some small random numerical differences) to the output on my workstation. Thanks for your help on this.