HMS-IDAC / UnMicst

UNet script, model, sample data
MIT License
11 stars 12 forks source link

Failed to call cuInit #20

Open josenimo opened 1 year ago

josenimo commented 1 year ago

Dear Unmicst developers,

I am trying to run MCMICRO on our HPC, which run with Altair Grid Engine. While trying to run on the Exemplar001, using the default nextflow config file, I find the following error message in the .command.log of nextflow process:

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
sh: 1: nvidia-smi: not found
2023-03-15 13:22:16.469590: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)

Not really sure why unmicst is looking for a GPU, or trying to nvidia-smi when I have not passed any options, is this the default? Do I have to call to use CPU? Or do I have always to request GPU resource from the HPC?

Thank you for your help! Jose

clarenceyapp commented 1 year ago

@josenimo unmicst uses tensorflow-gpu and attempts to put the job on a GPU by default. If no GPU is present, it falls back on CPU automatically and should not stop it from proceeding but it will take longer to finish. Is the job not finishing?

josenimo commented 1 year ago

@clarenceyapp it does not finish. I am not very familiar with the HPC but I am guessing that something is making the nextflow/Unmicst think there will be a GPU and then there is none. Any hints what could I look for.

I also tried with "segmentation: mesmer" and it ran into the following:

Error executing process > 'segmentation:worker (mesmer-1)'

Caused by:
  Process `segmentation:worker (mesmer-1)` terminated with an error exit status (2)

Command executed:

  python /usr/src/app/run_app.py mesmer --squeeze --output-directory . --output-name cell.tif --nuclear-image mcmicro5.ome.tif  --mpp 0.23 --compartment whole-cell

Command exit status:
  2

Command output:
  (empty)

Command error:
  INFO:    Environment variable SINGULARITYENV_TMP is set, but APPTAINERENV_TMP is preferred
  INFO:    Environment variable SINGULARITYENV_TMPDIR is set, but APPTAINERENV_TMPDIR is preferred
  INFO:    Environment variable SINGULARITYENV_NXF_DEBUG is set, but APPTAINERENV_NXF_DEBUG is preferred
  2023-03-15 14:54:56.504731: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/.singularity.d/libs
  2023-03-15 14:54:56.504794: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
  usage: run_app.py [-h] {mesmer} ...
  run_app.py: error: unrecognized arguments: --mpp 0.23

Work dir:
  /fast/AG_Coscia/Jose/Nextflow/Work/2a/2e8ea1575892c3c4d5dd7c9d80bca4

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

which also leads me to believe something is wrong with the HPC setup.

clarenceyapp commented 1 year ago

@josenimo can you send me any logs for unmicst? We routinely run unmicst without GPU on our HPC (mostly by accident) and it still finishes.

ArtemSokolov commented 1 year ago

The sh: 1: nvidia-smi: not found message appears because that's how @clarenceyapp checks for the presence of GPUs:

https://github.com/HMS-IDAC/UnMicst/blob/203c612550d4b1a2ef652908d99e207cd203ab21/UnMicst1-5.py#L750-L756

and falls back onto the CPU when the process fails. So, those messages can be generally ignored.

@clarenceyapp I would suggest doing something like os.system('nvidia-smi &> /dev/null') to prevent the output of your check from appearing in the logs, since that's a common source of confusion for UnMicst users.

For Mesmer, it looks like there is no such parameter --mpp. Based on their README (https://github.com/vanvalenlab/deepcell-applications), it seems the correct parameter name is --image-mpp

josenimo commented 1 year ago

@clarenceyapp not really sure what you mean with logs for unmicst. I looked into the work directory and the only files with information are .command.run and .command.log;

@ArtemSokolov thank you for your help too, would there be anything that could prevent the job from falling back into the CPU? I will try to rerun with --image-mpp thank you for catching that :)

ArtemSokolov commented 1 year ago

There should be a .command.log file in the work directory that corresponds to the failed UnMicst process. The work directory is listed near the bottom of the output, e.g.,

Work dir:
  /fast/AG_Coscia/Jose/Nextflow/Work/...full path here...

The .command.log will contain the full log of the UnMicst run and will help @clarenceyapp debug the issue.

josenimo commented 1 year ago

The .command.log contents were pasted inside the first post. They don't seem that informative. Here I upload the entire file

command - Copy.log

Edit1: changing --mpp to --image-mpp fixed the problem in the mesmer run, and got a successful mcmicro run

clarenceyapp commented 1 year ago

Hi @josenimo I wanted to check that I get the same error when I request CPU-only for exemplar-001 on the HMS HPC, and I do. It eventually finishes after 7 minutes and writes the rest of the log messages. Can I confirm that you are allowing it to run for at least this amount of time before you cancel it? Is there any other indication that the job is failing as opposed to just taking a long time on your HPC? Thanks.


WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/compat/v2_compat.py:111: disable_resource_variables (from$
Instructions for updating:
non-resource variables are not supported in the long term
sh: 1: nvidia-smi: not found
2023-03-16 00:00:18.171400: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: UNKNOWN ERROR (34)
/app/UnMicst1-5.py:114: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf$
  bn = tf.nn.leaky_relu(tf.layers.batch_normalization(c00+shortcut, training=UNet2D.tfTraining))
/usr/local/lib/python3.8/dist-packages/keras/legacy_tf_layers/normalization.py:455: UserWarning: `layer.apply` is deprecated and will be r$
  return layer.apply(inputs, training=training)
/app/UnMicst1-5.py:136: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf$
  lbn = tf.nn.leaky_relu(tf.layers.batch_normalization(
/app/UnMicst1-5.py:139: UserWarning: `tf.layers.dropout` is deprecated and will be removed in a future version. Please use `tf.keras.layer$
  return tf.layers.dropout(lbn, 0.35, training=UNet2D.tfTraining)
/usr/local/lib/python3.8/dist-packages/keras/legacy_tf_layers/core.py:401: UserWarning: `layer.apply` is deprecated and will be removed in$
  return layer.apply(inputs, training=training)
/app/UnMicst1-5.py:199: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf$
  tf.layers.batch_normalization(tf.nn.conv2d(cc, luXWeights2, strides=[1, 1, 1, 1], padding='SAME'),
/app/UnMicst1-5.py:220: UserWarning: `tf.layers.batch_normalization` is deprecated and will be removed in a future version. Please use `tf$
  return tf.layers.batch_normalization(
Using CPU
loading data
loading data
loading data
0.34
0.25
Model restored.
Using channel 1
Inference...
Inference...
Inference...

WARNING! USING unmicst-solo AS DEFAULT. THIS MODEL HAS BEEN TRAINED ON MORE TISSUE TYPES. IF YOU WANT THE LEGACY MODEL, USE --tool unmicst$

python /app/UnMicst1-5.py  exemplar-001.ome.tif --channel 0 --outputPath . --mean -1 --std -1 --scalingFactor 1 --GPU -1 --outlier -1 --st$