nimbus.predict_fovs() seems to be hanging indefinitely

angelolab / Nimbus

Other

12 stars 1 forks source link

nimbus.predict_fovs() seems to be hanging indefinitely #78

Closed whitneyt1 closed 9 months ago

whitneyt1 commented 10 months ago

Please refer to our FAQ and look at our known issues before opening a bug report.

Describe the bug After running all above cells with no issues, nibmus.predict_fovs() outputs 0-2 fovs within a reasonable amount of time, then continues to hang with no additional outputs even when run over night.

This is from being run on the UCSF GPU cluster with high compute power. If run it on my local, Nimbus output one predicted full predicted FOV and then continued to hang. After interrupting the kernel and reducing batch size, the normaliation_dict from the cell above is overwritten, but 0 fovs are predicted.

Expected behavior Consistent FOV prediction on all FOVs.

To Reproduce Please either copy/paste or screenshot all of the code you ran which produced the error. In addition, copy the full error message. This has occurred when run on 400 and 800um FOVs mixed in the same tiff_dir, and when the tiff_dir contains only 400um FOVs.

normalization_dict.json Here is the outputted normalization_dict, otherwise I have been running he notebook as is!

THank you!!

JLrumberger commented 9 months ago

Hi Whitney,

How large are the FOVs in terms of pixels? One easy way to speed up computation could be to set test_time_aug=False in the cell

nimbus = Nimbus(
    fov_paths=fov_paths,
    segmentation_naming_convention=segmentation_naming_convention,
    output_dir=nimbus_output_dir,
    exclude_channels=exclude_channels,
    save_predictions=True,
    batch_size=4,
    test_time_aug=False,
    input_shape=[1024,1024]
)

# check if all inputs are valid
nimbus.check_inputs()

Apart from that I'll re-run the pipeline a few times with my own datasets and see if I can reproduce the "running indefinitively" error. If I can't reproduce it, I could offer you to run Nimbus on your data on my device to reproduce the error and to send you the Nimbus outputs if you like. Of course, I would handle your data with care regarding security and privacy.

JLrumberger commented 9 months ago

Hi Whitney,

I had a look at your screenshot again and the cell output says Available GPUs: [], so no GPU was detected and running inference on CPU indeed takes a while. It could be that your system did not set CUDA_VISIBLE_DEVICE as an environment variable and thus tensorflow did not detect the GPU. The variable tells tensorflow which GPU to use, if it is not set, tensorflow will eventual use no GPU at all even though there are GPUs available on your HPC node. You can check if the environment variable was set before starting the notebook by typing echo $CUDA_VISIBLE_DEVICES in the shell. If it's not set, you should set it either via export CUDA_VISIBLE_DEVICES in the shell or within python via:

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"

You should set it to the GPU that you get assigned by the HPC scheduler. I hope I could help you.

whitneyt1 commented 9 months ago

Thank you so much! I didn't even notice that. I will try that! than you!!