Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
90 stars 31 forks source link

KeyError: '$' #284

Open patrick-willems opened 5 months ago

patrick-willems commented 5 months ago

Hey,

I am very eager to try out the newest casanovo version on some of our data. While the test MGF file works just fine, I keep running into the same error on our Orbitrap data (after MGF and mzML conversion). Seems like its looking for an amino acid for a '$' character, which doesn't seem to make sense?

Predicting DataLoader 0:   2%|▏         | 1/44 [25:28<18:15:27,  0.00it/s]Traceback (most recent call last):
  File "/home/prc/anaconda3/envs/casanovo_env/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/casanovo/casanovo.py", line 142, in sequence
    runner.predict(peak_path, output)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/casanovo/denovo/model_runner.py", line 163, in predict
    self.trainer.predict(self.model, self.loaders.test_dataloader())
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 864, in predict
    return call._call_and_handle_interrupt(
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 903, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
    return self.predict_loop.run()
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 122, in run
    self._predict_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 264, in _predict_step
    call._call_lightning_module_hook(trainer, "on_predict_batch_end", predictions, *hook_kwargs.values())
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/casanovo/denovo/model.py", line 902, in on_predict_batch_end
    self.peptide_mass_calculator.mass(peptide, charge),
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/depthcharge/masses.py", line 95, in mass
    calc_mass = sum([self.masses[aa] for aa in seq]) + self.h2o
  File "/home/prc/anaconda3/envs/casanovo_env/lib/python3.10/site-packages/depthcharge/masses.py", line 95, in <listcomp>
    calc_mass = sum([self.masses[aa] for aa in seq]) + self.h2o
KeyError: '$'

Thanks for the feedback! Patrick

wsnoble commented 5 months ago

Hi Patrick,

Can you send a sample input file and command line that led to this error? Also, the output of casanovo version as well as the operating system you're using.

Thanks. Bill

patrick-willems commented 5 months ago

Hey,

Thanks for the fast reply.

The version info is:

Casanovo: 4.0.1 Depthcharge: 0.2.3 Lightning: 2.1.3 PyTorch: 2.1.2+cu121

The operating system is linux, 20.04.1-Ubuntu (using CPU).

The 50 MB compressed MGF can be downloaded via this link

Command line (using the recently uploaded non-tryptic model): casanovo sequence -v info --model models/casanovo_nontryptic.ckpt <MGF file>

Thanks!

melihyilmaz commented 5 months ago

Hi Patrick,

We were able to reproduce the bug on a CPU-only Linux machine but not when running on the GPU. While we work on debugging the issue, you could give running it on this Google Colab notebook a try - I successfully ran your command, you can see the outputs. Just make sure your runtime type is either GPU or TPU if you want to rerun it.

wsnoble commented 5 months ago

I am also able to reproduce this on an M2 Mac. It occurs 7% of the way through, on batch 3 out of 44.

It would be nice if we could isolate which specific spectrum is causing the problem.

wsnoble commented 5 months ago

FWIW, here is a smaller example that reproduces the bug on my machine.

casanovo.yaml.txt small.mgf.txt

Predicting DataLoader 0:  50%|██████████          | 1/2 [00:31<00:31,  0.03it/s]Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/casanovo/casanovo.py", line 142, in sequence
    runner.predict(peak_path, output)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/casanovo/denovo/model_runner.py", line 163, in predict
    self.trainer.predict(self.model, self.loaders.test_dataloader())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 864, in predict
    return call._call_and_handle_interrupt(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 903, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
    return self.predict_loop.run()
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 122, in run
    self._predict_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 264, in _predict_step
    call._call_lightning_module_hook(trainer, "on_predict_batch_end", predictions, *hook_kwargs.values())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/casanovo/denovo/model.py", line 902, in on_predict_batch_end
    self.peptide_mass_calculator.mass(peptide, charge),
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/depthcharge/masses.py", line 95, in mass
    calc_mass = sum([self.masses[aa] for aa in seq]) + self.h2o
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/depthcharge/masses.py", line 95, in <listcomp>
    calc_mass = sum([self.masses[aa] for aa in seq]) + self.h2o
KeyError: '$'
Predicting DataLoader 0:  50%|█████     | 1/2 [00:31<00:31,  0.03it/s]   
melihyilmaz commented 4 months ago

I have identified the root cause of the issue: torch.topk() call here has a different behavior when operating on GPU vs. CPU tensors. When you provide a tensor of zeros to torch.topk() it returns the index 0, i.e. first index of the maximum value, for a GPU tensor but it returns a arbitrary index for a CPU tensor (see snippet below). The latter leads to decoding to continue after the stop token is predicted.

import torch

t11 = torch.tensor(
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
         0.0000, 0.0000],)

print(torch.topk(t11,1))
> torch.return_types.topk(
values=tensor([0.]),
indices=tensor([20]))

print(torch.topk(t11.cuda(),1))
> torch.return_types.topk(
values=tensor([0.], device='cuda:0'),
indices=tensor([0], device='cuda:0'))

I couldn't find a simple way to fix the behavior on CPU tensors (it seems that an topk isn't using a stable sorting algorithm) so we can either add a small value like 1e-8 to zero-th index in the score tensor to make sure it's the maximum or get rid of torch.topk() and figure out the top K decoding in an other way. @bittremieux what do you think?

wsnoble commented 4 months ago

Glad we got this sorted out, and I hope we can figure out an appropriate fix. But shouldn't we also submit an issue to the torch team?

melihyilmaz commented 4 months ago

I think a Torch PR to fix this (by introducing a stable sorting algorithm to get consistent indices) went stale a year or ago and generally I'm under the impression that arbitrariness of returned indices when there're multiple values is the intended default. Maybe this means we should also switch out from torch.topk().

bittremieux commented 4 months ago

Thanks Melih. It does seem that topk might also be unstable on the GPU and we just haven't encountered it yet (tldr from that issue: topk gives different results between GPU and CPU, but also within architectures when using different values for k). And this is not considered a priority issue. So the best solution might be to try to move away from torch.topk rather than try to work around it. How to do that without incurring a performance hit is less obvious.

wsnoble commented 4 months ago

So topk here is being used to select the top k beams? What is the scenario in which we provide it with a tensor of zeros? It seems like, conceptually, if they really are all equally good then randomly selecting one should be OK. Furthermore, Melih's solution of adding a small random number would presumably not fix the bug, since the one that works properly is to always select index zero. It sounds to me like we have a problem in how we're populating the input to the topk function.

melihyilmaz commented 4 months ago

Yes, and tensor of zeros correspond to any finished beams so when we're picking the next token for our beams, we mask out the already finished ones. For those finished beams, we need to consistently pick the zero index, which corresponds to the dummy token and not a AA or stop token, with topk.

In that sense I think modifying the finished_mask tensor to add a small constant like 1e-8 to the index zero could be an easy way of ensuring we consistently pick the dummy token for finished beams regardless of GPU/CPU usage without incurring a performance hit.

An alternative is using topk as is and modifying its output to reflect any finished beams based on finished_beams.

wsnoble commented 4 months ago

Well, without diving into the code myself, I don't totally follow your explanation, but Meilh, please go ahead and implement one of your two proposed fixes. We'll need to do some regression testing to be sure that the results haven't changed as well as some timing to make sure it doesn't slow us down too much.

melihyilmaz commented 4 months ago

I opened a draft PR implementing the first solution, @bittremieux let me know what you think and we can take a different approach.

bittremieux commented 4 months ago

Fixed in #290.

wsnoble commented 4 months ago

Alas, I am still seeing this bug, even though I'm using Casanovo v4.1.0. The command line is

casanovo sequence --config casanovo.yaml delPBP1_DB_NaN3_3.mgf

I'm attaching the config file. The MGF is from the 9-species benchmark (yeast).

Here is the output:

Seed set to 454
INFO: Model weights file /Users/wnoble/Library/Caches/casanovo/casanovo_massivekb_v4_0_0.ckpt retrieved from local cache
INFO: Casanovo version 4.1.0
INFO: Sequencing peptides from:
INFO:   ../delPBP1_DB_NaN3_3.mgf
GPU available: True (mps), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
INFO: Reading 1 files...
../delPBP1_DB_NaN3_3.mgf: 55124spectra [00:14, 3717.39spectra/s]
Predicting DataLoader 0:   0%|                          | 0/431 [00:00<?, ?it/s]WARNING: UserWarning: The operator 'aten::_nested_tensor_from_mask_left_aligned' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
WARNING: UserWarning: The operator 'aten::_nested_tensor_from_mask_left_aligned' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
Predicting DataLoader 0:   0%|                | 1/431 [01:01<7:20:32,  0.02it/s]Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/bin/casanovo", line 8, in <module>
    sys.exit(main())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/rich_click/rich_command.py", line 126, in main
    rv = self.invoke(ctx)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/casanovo/casanovo.py", line 143, in sequence
    runner.predict(peak_path, output)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/casanovo/denovo/model_runner.py", line 164, in predict
    self.trainer.predict(self.model, self.loaders.test_dataloader())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 864, in predict
    return call._call_and_handle_interrupt(
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 903, in _predict_impl
    results = self._run(model, ckpt_path=ckpt_path)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
    return self.predict_loop.run()
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 122, in run
    self._predict_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/loops/prediction_loop.py", line 264, in _predict_step
    call._call_lightning_module_hook(trainer, "on_predict_batch_end", predictions, *hook_kwargs.values())
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/casanovo/denovo/model.py", line 899, in on_predict_batch_end
    self.peptide_mass_calculator.mass(peptide, charge),
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/depthcharge/masses.py", line 95, in mass
    calc_mass = sum([self.masses[aa] for aa in seq]) + self.h2o
  File "/opt/homebrew/Caskroom/miniconda/base/envs/casanovo/lib/python3.10/site-packages/depthcharge/masses.py", line 95, in <listcomp>
    calc_mass = sum([self.masses[aa] for aa in seq]) + self.h2o
KeyError: '$'
Predicting DataLoader 0:   0%|          | 1/431 [01:01<7:23:08,  0.02it/s]   

casanovo.yaml.txt

I'm running this on a Mac with an MP2 chip.

wsnoble commented 4 months ago

FYI, I tried this same command on a linux box and did not run into any problem. So it's possible the problem only arises on a Mac.

melihyilmaz commented 4 months ago

I was able to run Casanovo without any issues on a CPU-only Linux lab computer and Colab instance so it’s likely a Mac problem.