Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
116 stars 39 forks source link

Example Job Unsuccessful #90

Closed kostrouc closed 2 years ago

kostrouc commented 2 years ago

Hello, I am unable to get the example "casanovo --mode=denovo --peak_path=[PATH_TO]/sample_preprocessed_spectra.mgf" to run successfully. I am not sure what needs to be changed. Any suggestions would be appreciated.

"casanovo --help" returns the correct information. The sample_preprocessed_spectra.mgf and config.yaml files were saved. I'm not sure what else needs to be set up for denovo to function properly.

Thank you

error.txt

bittremieux commented 2 years ago

Hmm, somehow the default config file packaged with Casanovo might be corrupted. Can you share this file here: /Users/myname/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/casanovo/config.yaml?

kostrouc commented 2 years ago

It won't let me upload .yaml here. Here is the file contents:

(casanovo_env) myname@myip Downloads % cat config.yaml /###

Casanovo configuration.

Blank entries are interpreted as "None"

Random seed to ensure reproducible results.

random_seed: 454

Spectrum processing options.

n_peaks: 150 min_mz: 50.0 max_mz: 2500.0 min_intensity: 0.01 remove_precursor_tol: 2.0 # Da max_charge: 10 precursor_mass_tol: 50 # ppm isotope_error_range: [0, 1]

Model architecture options.

dim_model: 512 n_head: 8 dim_feedforward: 1024 n_layers: 9 dropout: 0.0 dim_intensity: custom_encoder: max_length: 100 residues: "G": 57.021464 "A": 71.037114 "S": 87.032028 "P": 97.052764 "V": 99.068414 "T": 101.047670 "C+57.021": 160.030649 # 103.009185 + 57.021464 "L": 113.084064 "I": 113.084064 "N": 114.042927 "D": 115.026943 "Q": 128.058578 "K": 128.094963 "E": 129.042593 "M": 131.040485 "H": 137.058912 "F": 147.068414 "R": 156.101111 "Y": 163.063329 "W": 186.079313

Amino acid modifications.

"M+15.995": 147.035400 # Met oxidation: 131.040485 + 15.994915 "N+0.984": 115.026943 # Asn deamidation: 114.042927 + 0.984016 "Q+0.984": 129.042594 # Gln deamidation: 128.058578 + 0.984016

N-terminal modifications.

"+42.011": 42.010565 # Acetylation "+43.006": 43.005814 # Carbamylation "-17.027": -17.026549 # NH3 loss "+43.006-17.027": 25.980265 n_log: 1 tb_summarywriter: warmup_iters: 100_000 max_iters: 600_000 learning_rate: 5e-4 weight_decay: 1e-5

Training/inference options.

train_batch_size: 32 predict_batch_size: 1024

logger: max_epochs: 30 num_sanity_val_steps: 0

train_from_scratch: True

save_model: True model_save_folder_path: "" save_weights_only: True every_n_train_steps: 50_000

bittremieux commented 2 years ago

Can you put the YAML content in a code block (three backticks) so I can see the exact formatting?

kostrouc commented 2 years ago
/###
# Casanovo configuration.
# Blank entries are interpreted as "None"
###

# Random seed to ensure reproducible results.
random_seed: 454

# Spectrum processing options.
n_peaks: 150
min_mz: 50.0
max_mz: 2500.0
min_intensity: 0.01
remove_precursor_tol: 2.0  # Da
max_charge: 10
precursor_mass_tol: 50  # ppm
isotope_error_range: [0, 1]

# Model architecture options.
dim_model: 512
n_head: 8
dim_feedforward: 1024
n_layers: 9
dropout: 0.0
dim_intensity:
custom_encoder:
max_length: 100
residues:
  "G": 57.021464
  "A": 71.037114
  "S": 87.032028
  "P": 97.052764
  "V": 99.068414
  "T": 101.047670
  "C+57.021": 160.030649 # 103.009185 + 57.021464
  "L": 113.084064
  "I": 113.084064
  "N": 114.042927
  "D": 115.026943
  "Q": 128.058578
  "K": 128.094963
  "E": 129.042593
  "M": 131.040485
  "H": 137.058912
  "F": 147.068414
  "R": 156.101111
  "Y": 163.063329
  "W": 186.079313
  # Amino acid modifications.
  "M+15.995": 147.035400    # Met oxidation:   131.040485 + 15.994915
  "N+0.984": 115.026943     # Asn deamidation: 114.042927 +  0.984016
  "Q+0.984": 129.042594     # Gln deamidation: 128.058578 +  0.984016
  # N-terminal modifications.
  "+42.011": 42.010565      # Acetylation
  "+43.006": 43.005814      # Carbamylation
  "-17.027": -17.026549     # NH3 loss
  "+43.006-17.027": 25.980265
n_log: 1
tb_summarywriter:
warmup_iters: 100_000
max_iters: 600_000
learning_rate: 5e-4
weight_decay: 1e-5

# Training/inference options.
train_batch_size: 32
predict_batch_size: 1024

logger:
max_epochs: 30
num_sanity_val_steps: 0

train_from_scratch: True

save_model: True
model_save_folder_path: ""
save_weights_only: True
every_n_train_steps: 50_000
bittremieux commented 2 years ago

The problem is that starting forward slash / I think, which renders the file invalid YAML. Did you modify this file yourself? The default config.yaml that's packaged with Casanovo should normally not include the /.

kostrouc commented 2 years ago

I copied the config.yaml text from the folder here on GitHub. Not sure why the / was added. When I removed this it now gives an error about not finding a cpu_affinity attribute.

(casanovo_env) katherineostrouchov@myip casanovo % casanovo --mode=denovo --peak_path=sample_preprocessed_spectra.mgf
Traceback (most recent call last):
  File "/Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/casanovo/casanovo.py", line 166, in main
    config["n_workers"] = len(psutil.Process().cpu_affinity())
AttributeError: 'Process' object has no attribute 'cpu_affinity'
bittremieux commented 2 years ago

This was indeed an issue on MacOS that was recently fixed, but is not in the release on PyPI yet. To get the latest functionality, I recommend installing from GitHub for the time being:

pip uninstall casanovo && pip install git+https://github.com/Noble-Lab/casanovo.git
kostrouc commented 2 years ago

The script was successful this time. However, a warning was passed regarding num_workers in the DataLoader init from pytorch. I'm not sure where this script is located and how to update it.

(casanovo_env) katherineostrouchov@myip casanovo % export PYTORCH_ENABLE_MPS_FALLBACK=1
(casanovo_env) katherineostrouchov@myip casanovo % casanovo --mode=denovo --peak_path=sample_preprocessed_spectra.mgf
2022-11-03 13:39:43,864 WARNING [py.warnings/MainProcess] warnings._showwarnmsg : /Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/pytorch_lightning/utilities/seed.py:48: LightningDeprecationWarning: `pytorch_lightning.utilities.seed.seed_everything` has been deprecated in v1.8.0 and will be removed in v1.10.0. Please use `lightning_lite.utilities.seed.seed_everything` instead.
  rank_zero_deprecation(

Global seed set to 454
2022-11-03 13:39:43,869 INFO [casanovo/MainProcess] casanovo._get_model_weights : Model weights file /Users/katherineostrouchov/Library/Caches/casanovo/casanovo_massivekb_v3_0_0.ckpt retrieved from local cache
2022-11-03 13:39:43,870 INFO [casanovo/MainProcess] casanovo.main : Casanovo version 3.0.1.dev4+gf3696ca
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : mode = denovo
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : model = /Users/katherineostrouchov/Library/Caches/casanovo/casanovo_massivekb_v3_0_0.ckpt
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : peak_path = sample_preprocessed_spectra.mgf
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : peak_path_val = None
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : config = /Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/casanovo/config.yaml
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : output = /Users/katherineostrouchov/Library/CloudStorage/OneDrive-UniversityofTennessee/IDEXX/casanovo/casanovo_20221103133943
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : random_seed = 454
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : n_peaks = 150
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : min_mz = 50.0
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : max_mz = 2500.0
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : min_intensity = 0.01
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : remove_precursor_tol = 2.0
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : max_charge = 10
2022-11-03 13:39:43,870 DEBUG [casanovo/MainProcess] casanovo.main : precursor_mass_tol = 50.0
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : isotope_error_range = (0, 1)
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : dim_model = 512
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : n_head = 8
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : dim_feedforward = 1024
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : n_layers = 9
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : dropout = 0.0
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : dim_intensity = None
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : custom_encoder = None
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : max_length = 100
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : residues = {'G': 57.021464, 'A': 71.037114, 'S': 87.032028, 'P': 97.052764, 'V': 99.068414, 'T': 101.04767, 'C+57.021': 160.030649, 'L': 113.084064, 'I': 113.084064, 'N': 114.042927, 'D': 115.026943, 'Q': 128.058578, 'K': 128.094963, 'E': 129.042593, 'M': 131.040485, 'H': 137.058912, 'F': 147.068414, 'R': 156.101111, 'Y': 163.063329, 'W': 186.079313, 'M+15.995': 147.0354, 'N+0.984': 115.026943, 'Q+0.984': 129.042594, '+42.011': 42.010565, '+43.006': 43.005814, '-17.027': -17.026549, '+43.006-17.027': 25.980265}
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : n_log = 1
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : tb_summarywriter = None
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : warmup_iters = 100000
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : max_iters = 600000
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : learning_rate = 0.0005
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : weight_decay = 1e-05
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : train_batch_size = 32
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : predict_batch_size = 1024
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : logger = None
2022-11-03 13:39:43,871 DEBUG [casanovo/MainProcess] casanovo.main : max_epochs = 30
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : num_sanity_val_steps = 0
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : train_from_scratch = True
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : save_model = True
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : model_save_folder_path = 
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : save_weights_only = True
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : every_n_train_steps = 50000
2022-11-03 13:39:43,872 DEBUG [casanovo/MainProcess] casanovo.main : n_workers = 0
2022-11-03 13:39:43,872 INFO [casanovo/MainProcess] casanovo.main : Predict peptide sequences with Casanovo.
2022-11-03 13:39:43,996 DEBUG [fsspec.local/MainProcess] local.__init__ : open file: /Users/katherineostrouchov/Library/Caches/casanovo/casanovo_massivekb_v3_0_0.ckpt
2022-11-03 13:39:44,324 INFO [depthcharge.data.hdf5/MainProcess] hdf5.__init__ : Reading 1 files...
sample_preprocessed_spectra.mgf: 128spectra [00:00, 3760.92spectra/s]
2022-11-03 13:39:44,473 WARNING [py.warnings/MainProcess] warnings._showwarnmsg : /Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, predict_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(

Predicting DataLoader 0:   0%|                       | 0/1 [00:00<?, ?it/s]2022-11-03 13:39:52,672 WARNING [py.warnings/MainProcess] warnings._showwarnmsg : /Users/katherineostrouchov/opt/anaconda3/envs/casanovo_env/lib/python3.8/site-packages/torch/nn/modules/transformer.py:276: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/NestedTensorImpl.cpp:177.)
  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)

Predicting DataLoader 0: 100%|██████████████| 1/1 [06:00<00:00, 360.20s/it]
bittremieux commented 2 years ago

I'm glad you got it to work!

There are indeed a few warnings, but these can be ignored as they don't influence correct functioning of Casanovo. In particular, on MacOS we are restricted to only using a single thread for the data laoder due to incompatibilities with Apple's M1 chip and multiprocessing. Consequently, Casanovo might run a bit slower, but will still work correctly.

In general, for the most optimal performance, we recommend running on Linux and using a GPU (or multiple).