[BUG] Error when setting up a new brainglobe env

Eddymorphling commented 2 months ago

Hi folks, I decided to setup a new brainglobe-workflow env today. Setup brainglobe using Python - 3.10 and followed all instructions found here and also setup Pytorch to use GPU. Brain registration runs well but cell candidate detection throws an error (below).

 File "/home/mvn/conda/envs/brainglobe-env/bin/brainmapper", line 8, in <module>
    sys.exit(main())
  File "/home/mvn/conda/envs/brainglobe-env/lib/python3.10/site-packages/brainglobe_workflows/brainmapper/main.py", line 94, in main
    run_all(args, what_to_run, atlas)
  File "/home/mvn/conda/envs/brainglobe-env/lib/python3.10/site-packages/brainglobe_workflows/brainmapper/main.py", line 122, in run_all
    points = detect.main(
  File "/home/mvn/conda/envs/brainglobe-env/lib/python3.10/site-packages/cellfinder/core/detect/detect.py", line 166, in main
    set_num_threads(max(n_ball_procs - int(ball_z_size), 1))
  File "/home/mvn/conda/envs/brainglobe-env/lib/python3.10/site-packages/numba/np/ufunc/parallel.py", line 610, in set_num_threads
    snt_check(n)
  File "/home/mvn/conda/envs/brainglobe-env/lib/python3.10/site-packages/numba/np/ufunc/parallel.py", line 572, in snt_check
    raise ValueError(msg)
ValueError: The number of threads must be between 1 and 4

Any clue what might be wrong?

adamltyson commented 2 months ago

Could you attach the log file?

Eddymorphling commented 2 months ago

Yes, here is the log. Just want to add that I am running this on a LSF cluster that uses a single GPU and 4 CPU cores for my job. Here is the log output:


**************  BRAINMAPPER LOG  **************

Ran at : 2024-07-16_10-01-06
Output directory: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output
Current directory: /home/mvn/HPC_Jobs
Version: 1.3.0

**************  GIT INFO  **************

Gitpython is not installed. Cannot check if software is in a git repository

**************  COMMAND LINE ARGUMENTS  **************

Command: /home/mvn/conda/envs/brainglobe-env/bin/brainmapper 
Input arguments: ['-s', '/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs', '-b', '/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs', '-o', '/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output', '-v', '3.9', '3.9', '3.9', '--orientation', 'sal', '--soma-spread-factor', '1.4', '--threshold', '9', '--soma-diameter', '10', '--atlas', 'allen_mouse_10um', '--ball-xy-size', '2', '--ball-z-size', '2', '--trained-model', '/home/mvn/scratch/BigStitcher_output/MSR26395/models/R5/model.h5', '--ball-overlap-fraction', '0.4', '--soma-spread-factor', '0', '--batch-size', '256', '--n-free-cpus', '1']

**************  VARIABLES  **************

Namespace:
signal_planes_paths: ['/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs']
background_planes_path: ['/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs']
output_dir: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output
signal_ch_ids: None
background_ch_id: None
registration_config: /home/mvn/.brainglobe/cellfinder/cellfinder.conf.custom
voxel_sizes: ['3.9', '3.9', '3.9']
network_voxel_sizes: [5, 1, 1]
no_detection: False
no_classification: False
no_register: False
no_analyse: False
no_figures: False
start_plane: 0
end_plane: -1
save_planes: False
outlier_keep: False
artifact_keep: False
max_cluster_size: 100000
soma_diameter: 10.0
ball_xy_size: 2
ball_z_size: 2
ball_overlap_fraction: 0.4
log_sigma_size: 0.2
n_sds_above_mean_thresh: 9.0
soma_spread_factor: 0.0
trained_model: /home/mvn/scratch/BigStitcher_output/MSR26395/models/R5/model.h5
model_weights: None
network_depth: 50
batch_size: 256
cube_width: 50
cube_height: 50
cube_depth: 20
save_empty_cubes: False
n_free_cpus: 1
max_ram: None
save_csv: False
debug: False
sort_input_file: False
heatmap_smooth: 100
mask_figures: True
install_path: /home/mvn/.brainglobe/cellfinder/models
no_amend_config: False
model: resnet50_tv
atlas: allen_mouse_10um
orientation: sal
backend: niftyreg
affine_n_steps: 6
affine_use_n_steps: 5
freeform_n_steps: 6
freeform_use_n_steps: 4
bending_energy_weight: 0.95
grid_spacing: -10
smoothing_sigma_reference: -1.0
smoothing_sigma_floating: -1.0
histogram_n_bins_floating: 128
histogram_n_bins_reference: 128

Paths:
output_dir: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output
registration_output_folder: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output/registration
metadata_path: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output/brainmapper.json
registration_metadata_path: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output/registration/brainreg.json

**************  LOGGING  **************

2024-07-16 10:01:06 AM - INFO - MainProcess fancylog.py:314 - Starting logging
2024-07-16 10:01:06 AM - INFO - MainProcess fancylog.py:315 - Multiprocessing-logging module found. Logging from all processes
2024-07-16 10:01:06 AM - WARNING - MainProcess prep.py:262 - Registered atlas exists, assuming already run. Skipping.
2024-07-16 10:01:08 AM - INFO - MainProcess main.py:70 - Skipping registration
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 7 to 5
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 5 to 7
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 7 to 5
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 5 to 7
2024-07-16 10:01:42 AM - WARNING - MainProcess prep.py:262 - Registered atlas exists, assuming already run. Skipping.
2024-07-16 10:01:42 AM - INFO - MainProcess main.py:116 - Detecting cell candidates
2024-07-16 10:01:42 AM - DEBUG - MainProcess system.py:203 - Determining the maximum number of CPU cores to use
2024-07-16 10:01:42 AM - DEBUG - MainProcess system.py:208 - Number of CPU cores available is: 63
2024-07-16 10:01:42 AM - DEBUG - MainProcess system.py:235 - Setting number of processes to: 63

adamltyson commented 2 months ago

Ah, I think the problem may be that brainmapper isn't aware of the LSF environment. We only have support for SLURM. brainmapper "asks" the system for how many CPU cores are available, and this is the physical number on the system, not what is allocated to your job. We would love to support other job shedulers, but we would likely need a considerable amount of time from someone with access to one of these systems to develop and test things. In the mean time could you try these two things:

Request the whole cpu/node, so you have access to the entire 64 cores
Limit the number of CPU cores by setting --n-free-cpus to (number of physical CPU cores - number of cores you requested). In this case I think it should be 60.

N.B. Using more cores should speed up detection considerably.

Eddymorphling commented 2 months ago

@adamltyson Thanks! Thebrainmapper CLI used to work well on my LSF environment in the past, so I am not sure what has changed with the recent updates. Yesterday, I found a work around. I noticed that in detect.py (line 166) the number of threads is being defined by the function set_num_threads(max(n_ball_procs - int(ball_z_size), 1)). I replaced this to manually to set the min and max depending on the number of cores I had requested for during my LSF job run. Here is what I did

num_threads = max(n_ball_procs - int(ball_z_size), 1)
num_threads = min(max(num_threads, 1), 4)  # Ensure it's between 1 and 10 cores (depending on how many cores requested)
set_num_threads(num_threads)

This seems to work for now. Is this a simple workaround? I have not tried the suggestions you had mentioned in the post yet.

Also, my self-trained TF2 models that I had created in the past does not seem to work with Pytorch unfortunately :(

adamltyson commented 2 months ago

This seems to work for now. Is this a simple workaround? I have not tried the suggestions you had mentioned in the post yet.

Possibly, with the caveats that it may stop working in the future, and I'm not sure if there will be other issues down the line.

Also, my self-trained TF2 models that I had created in the past does not seem to work with Pytorch unfortunately :(

Weird, this did work in all of our testing. @IgorTatarnikov do you have any ideas?

IgorTatarnikov commented 2 months ago

No ideas off the top of my head. How old are the models? Perhaps we're hitting some sort of double legacy format? Would it be possible to share a model that's not working for testing purposes?

adamltyson commented 2 months ago

There shouldn't be a "double legacy issue", the models we were testing this on were trained before we released cellfinder!

@Eddymorphling if re-training these models will be a problem for you, if you share one of the older, TF ones, we could take a look and see if we can get it to work.

Eddymorphling commented 2 months ago

Hi @adamltyson and @IgorTatarnikov. Sorry for the delay. I think I got the new setup that uses Pytorch instead of TF working in the end on my LSF clsuter. I also realised that I would need to make a new model in anycase so the decided to let go of the old TF model for now.

Also I had to downgrade numpy from 2.0.0to 1.23.4to get everything running smoothly

adamltyson commented 2 months ago

I don't suppose you know what the error was with numpy 2.0.0? I wasn't aware of any incompatibilities, but we should fix this!

Eddymorphling commented 1 month ago

Ah unfortunately no :(

adamltyson commented 1 month ago

No worries. I'll close this issue now, but reach out if you have any other issues.

brainglobe / brainglobe-workflows

[BUG] Error when setting up a new brainglobe env #122