Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
116 stars 40 forks source link

TPU warnings when running on Colab #259

Closed bittremieux closed 4 months ago

bittremieux commented 1 year ago

As reported via email, when running Casanovo on a TPU-enabled Colab instance, it gives some warnings:

2023-10-26 10:04:27.276313: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-10-26 10:04:27.326604: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-26 10:04:27.326651: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-26 10:04:27.326685: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-26 10:04:27.335749: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

These seem harmless, so we could try to avoid them from being printed.

Lilferrit commented 4 months ago

Unfortunately when I tried to replicate this in a TPU runtime environment I get this error:

2024-06-27 00:27:15.702704: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:95] Opening library: /usr/local/lib/python3.10/dist-packages/tensorflow/python/platform/../../libtensorflow_cc.so.2
2024-06-27 00:27:15.702909: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:119] Libtpu path is: libtpu.so
2024-06-27 00:27:15.759569: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Seed set to 454
INFO: Casanovo version 4.2.1.dev1+gc6a455b.d20240627
INFO: Sequencing peptides from:
INFO:   sample_data/sample_preprocessed_spectra.mgf
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/message.cc:258] File is already registered: xla/service/cpu/backend_config.proto
terminate called after throwing an instance of 'google::protobuf::FatalException'
  what():  File is already registered: xla/service/cpu/backend_config.proto
https://symbolize.stripped_domain/r/?trace=7f16bb6f59fc,7f16bb6a151f&map= 
*** SIGABRT received by PID 1782 (TID 1782) on cpu 24 from PID 1782; stack trace: ***
PC: @     0x7f16bb6f59fc  (unknown)  pthread_kill
    @     0x7f15c82214f9        928  (unknown)
    @     0x7f16bb6a1520  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f16bb6f59fc,7f15c82214f8,7f16bb6a151f&map=5edeb7d86db111100e979a74159a3982:7f15b8600000-7f15c8440ba0 
E0627 00:27:21.381225    1782 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0627 00:27:21.381246    1782 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0627 00:27:21.381253    1782 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0627 00:27:21.381276    1782 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0627 00:27:21.381284    1782 coredump_hook.cc:603] RAW: Dumping core locally.
E0627 00:27:23.920542    1782 process_state.cc:808] RAW: Raising signal 6 with default behavior

This also occurred after explicitly installing the PyTorch package release that supports TPUs. Here is the notebook I used to try to replicate the log entries: https://colab.research.google.com/drive/1zFZ248QPRT5ddXEOC2LBwUronJOWbMAE?usp=sharing

bittremieux commented 4 months ago

I couldn't get a TPU instance, but the warnings are also there when running an a CPU or GPU CoLab instance, so it's probably related to CoLab rather than TPU. You could briefly look into it, but I don't think it's worth wasting a lot of time on this.

Lilferrit commented 4 months ago

Another issue: through some experimenting it looks like the tensorflow warnings are logged before the Casanovo module is even loaded, so it looks like trying to filter the warnings would be a lot more difficult than its worth.