Closed Eddymorphling closed 1 month ago
Could you attach the log file?
Yes, here is the log. Just want to add that I am running this on a LSF cluster that uses a single GPU and 4 CPU cores for my job. Here is the log output:
************** BRAINMAPPER LOG **************
Ran at : 2024-07-16_10-01-06
Output directory: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output
Current directory: /home/mvn/HPC_Jobs
Version: 1.3.0
************** GIT INFO **************
Gitpython is not installed. Cannot check if software is in a git repository
************** COMMAND LINE ARGUMENTS **************
Command: /home/mvn/conda/envs/brainglobe-env/bin/brainmapper
Input arguments: ['-s', '/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs', '-b', '/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs', '-o', '/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output', '-v', '3.9', '3.9', '3.9', '--orientation', 'sal', '--soma-spread-factor', '1.4', '--threshold', '9', '--soma-diameter', '10', '--atlas', 'allen_mouse_10um', '--ball-xy-size', '2', '--ball-z-size', '2', '--trained-model', '/home/mvn/scratch/BigStitcher_output/MSR26395/models/R5/model.h5', '--ball-overlap-fraction', '0.4', '--soma-spread-factor', '0', '--batch-size', '256', '--n-free-cpus', '1']
************** VARIABLES **************
Namespace:
signal_planes_paths: ['/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs']
background_planes_path: ['/home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/tiffs']
output_dir: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output
signal_ch_ids: None
background_ch_id: None
registration_config: /home/mvn/.brainglobe/cellfinder/cellfinder.conf.custom
voxel_sizes: ['3.9', '3.9', '3.9']
network_voxel_sizes: [5, 1, 1]
no_detection: False
no_classification: False
no_register: False
no_analyse: False
no_figures: False
start_plane: 0
end_plane: -1
save_planes: False
outlier_keep: False
artifact_keep: False
max_cluster_size: 100000
soma_diameter: 10.0
ball_xy_size: 2
ball_z_size: 2
ball_overlap_fraction: 0.4
log_sigma_size: 0.2
n_sds_above_mean_thresh: 9.0
soma_spread_factor: 0.0
trained_model: /home/mvn/scratch/BigStitcher_output/MSR26395/models/R5/model.h5
model_weights: None
network_depth: 50
batch_size: 256
cube_width: 50
cube_height: 50
cube_depth: 20
save_empty_cubes: False
n_free_cpus: 1
max_ram: None
save_csv: False
debug: False
sort_input_file: False
heatmap_smooth: 100
mask_figures: True
install_path: /home/mvn/.brainglobe/cellfinder/models
no_amend_config: False
model: resnet50_tv
atlas: allen_mouse_10um
orientation: sal
backend: niftyreg
affine_n_steps: 6
affine_use_n_steps: 5
freeform_n_steps: 6
freeform_use_n_steps: 4
bending_energy_weight: 0.95
grid_spacing: -10
smoothing_sigma_reference: -1.0
smoothing_sigma_floating: -1.0
histogram_n_bins_floating: 128
histogram_n_bins_reference: 128
Paths:
output_dir: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output
registration_output_folder: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output/registration
metadata_path: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output/brainmapper.json
registration_metadata_path: /home/mvn/scratch/BigStitcher_output/MSR26395/Brain_627/dataset/output/registration/brainreg.json
************** LOGGING **************
2024-07-16 10:01:06 AM - INFO - MainProcess fancylog.py:314 - Starting logging
2024-07-16 10:01:06 AM - INFO - MainProcess fancylog.py:315 - Multiprocessing-logging module found. Logging from all processes
2024-07-16 10:01:06 AM - WARNING - MainProcess prep.py:262 - Registered atlas exists, assuming already run. Skipping.
2024-07-16 10:01:08 AM - INFO - MainProcess main.py:70 - Skipping registration
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 7 to 5
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 5 to 7
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 7 to 5
2024-07-16 10:01:34 AM - DEBUG - MainProcess __init__.py:47 - Creating converter from 5 to 7
2024-07-16 10:01:42 AM - WARNING - MainProcess prep.py:262 - Registered atlas exists, assuming already run. Skipping.
2024-07-16 10:01:42 AM - INFO - MainProcess main.py:116 - Detecting cell candidates
2024-07-16 10:01:42 AM - DEBUG - MainProcess system.py:203 - Determining the maximum number of CPU cores to use
2024-07-16 10:01:42 AM - DEBUG - MainProcess system.py:208 - Number of CPU cores available is: 63
2024-07-16 10:01:42 AM - DEBUG - MainProcess system.py:235 - Setting number of processes to: 63
Ah, I think the problem may be that brainmapper
isn't aware of the LSF environment. We only have support for SLURM. brainmapper
"asks" the system for how many CPU cores are available, and this is the physical number on the system, not what is allocated to your job. We would love to support other job shedulers, but we would likely need a considerable amount of time from someone with access to one of these systems to develop and test things. In the mean time could you try these two things:
--n-free-cpus
to (number of physical CPU cores - number of cores you requested). In this case I think it should be 60
.N.B. Using more cores should speed up detection considerably.
@adamltyson Thanks! Thebrainmapper
CLI used to work well on my LSF environment in the past, so I am not sure what has changed with the recent updates. Yesterday, I found a work around. I noticed that in detect.py
(line 166) the number of threads is being defined by the function set_num_threads(max(n_ball_procs - int(ball_z_size), 1))
. I replaced this to manually to set the min and max depending on the number of cores I had requested for during my LSF job run. Here is what I did
num_threads = max(n_ball_procs - int(ball_z_size), 1)
num_threads = min(max(num_threads, 1), 4) # Ensure it's between 1 and 10 cores (depending on how many cores requested)
set_num_threads(num_threads)
This seems to work for now. Is this a simple workaround? I have not tried the suggestions you had mentioned in the post yet.
Also, my self-trained TF2 models that I had created in the past does not seem to work with Pytorch unfortunately :(
This seems to work for now. Is this a simple workaround? I have not tried the suggestions you had mentioned in the post yet.
Possibly, with the caveats that it may stop working in the future, and I'm not sure if there will be other issues down the line.
Also, my self-trained TF2 models that I had created in the past does not seem to work with Pytorch unfortunately :(
Weird, this did work in all of our testing. @IgorTatarnikov do you have any ideas?
No ideas off the top of my head. How old are the models? Perhaps we're hitting some sort of double legacy format? Would it be possible to share a model that's not working for testing purposes?
There shouldn't be a "double legacy issue", the models we were testing this on were trained before we released cellfinder!
@Eddymorphling if re-training these models will be a problem for you, if you share one of the older, TF ones, we could take a look and see if we can get it to work.
Hi @adamltyson and @IgorTatarnikov. Sorry for the delay. I think I got the new setup that uses Pytorch instead of TF working in the end on my LSF clsuter. I also realised that I would need to make a new model in anycase so the decided to let go of the old TF model for now.
Also I had to downgrade numpy
from 2.0.0
to 1.23.4
to get everything running smoothly
I don't suppose you know what the error was with numpy 2.0.0? I wasn't aware of any incompatibilities, but we should fix this!
Ah unfortunately no :(
No worries. I'll close this issue now, but reach out if you have any other issues.
Hi folks, I decided to setup a new brainglobe-workflow env today. Setup brainglobe using Python - 3.10 and followed all instructions found here and also setup Pytorch to use GPU. Brain registration runs well but cell candidate detection throws an error (below).
Any clue what might be wrong?