fteufel / signalp-6.0

Multi-class signal peptide prediction and structure decoding model.
https://services.healthtech.dtu.dk/service.php?SignalP-6.0
Other
82 stars 15 forks source link

Still having issues with `resolve_viterbi_marginal_conflicts` #3

Closed darcyabjones closed 2 years ago

darcyabjones commented 2 years ago

Hi there!

Sorry to bother you again. I'm still running into issues with the decoding step.

Running this sequence with SignalP 6.0e raises an error:

>P000004B9
MAFRLFAGITGRQLLAGGAALGGTGLAGSLIQTESERLQATEAQVQFHTSSIHPTPVGFS
PWQIRNDYPTSDILKARLKAQKDDSLPNAPSPLIPAPGLPGDFEGENAPWFKYDYEKEPE
KFAEAIREYCFDGNVDKGFRLNENKIRDWYHAPWMHYRDPNSMCTEREPINGFTFERATP
AGEFAKTQNVTLQNWAIGFYNATGATVFGDMWKDPDNPDFSQNKEFPVGTCVFKILLNNS
TPEQMPIQDGAPTMHAVISKSTSNGKERNDFASPLRLIQVDFAVVDKRSPIGWVFGTFMY
NKDQPGKGPWDRLTLVGLQWGNDHWLTNQVYDETKAEGRVAKPRECYIHKKAEDIRKREG
GTRPSWGWNGRMNGPADNFISACASCHSTSTSHPMYNGKVKDGVKQTYGMVPPLNMKPLP
PQPKEGNTFSDVMIYFRNVMGGVPFDEGVNPNNPDEYDPTYKSKVKSADYSLQLQVGWAN
YKKWKEDHETVLQSIFRKTRYVIGSELAGASDLSQRDQGRQEPTDDGPVE
$signalp6  --fastafile "${1}" --output_dir "${TMPDIR}" --format none --organism eukarya --mode fast  --bsize "32" --write_procs 1
Predicting: 100%|██████████| 1/1 [00:04<00:00,  4.87s/sequences]
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/predector/bin/signalp6", line 8, in <module>
    sys.exit(predict())
  File "/home/ubuntu/miniconda3/envs/predector/lib/python3.6/site-packages/signalp/__init__.py", line 6, in predict
    main()
  File "/home/ubuntu/miniconda3/envs/predector/lib/python3.6/site-packages/signalp/predict.py", line 239, in main
    resolve_viterbi_marginal_conflicts(global_probs, marginal_probs, cleavage_sites, viterbi_paths)
  File "/home/ubuntu/miniconda3/envs/predector/lib/python3.6/site-packages/signalp/utils.py", line 311, in resolve_viterbi_marginal_conflicts
    cleavage_sites[i] = sp_idx.max() +1
  File "/home/ubuntu/miniconda3/envs/predector/lib/python3.6/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity

I haven't had a huge amount of time to debug it (or decipher how it all works), but it seems as though the marginal probabilities in type_marginal_probs are all assigning it to the PAD token, so you end up with a zero length array at np.where(np.isin(marginal_region_preds, [5, 10, 19, 25, 31]))[0].

I wonder if a property unit testing framework (like https://hypothesis.readthedocs.io/en/latest/) would be helpful for finding all of these edge cases and appropriately handle them? It seems to have become a troublesome issue.

fteufel commented 2 years ago

Hi, Please do keep bothering me! This helps a lot. Turns out this is a new bug I introduced myself with the 6.0e update (which was supposed to catch all edge cases). I'll update my CI to do the test runs also with the eukarya mode enabled from now on...

Anyway, I should have an updated version online tomorrow. If it's urgent, you could add the following quick fix at line 237 of predict.py

  if args.organism == 'eukarya':
      global_probs[:,1] = global_probs[:,1:].sum(axis=1)
      global_probs[:,2:] = 0

I'll look into this testing framework! So far we relied on running large reference proteomes to identify the edge cases.

Will close the issue once the updated downloads go live.

darcyabjones commented 2 years ago

Hey again,

Sorry just in case you haven't found other issues yet, i've got another one that still fails with your patch to 6.0e. Same error.

>P00000D45
MYSRLFYLKSSYIIYFEPLFSNAIINILSFINSLASPLTIFCFALSAQALSTIFYFRIFI
FIFHSWILLFHFYFTCSFKTYEHQHSKMVPAYRMQSPRALPRTYLYVWPYK
B10inform commented 2 years ago

Hi,

I am getting similar issues with the signalp6g. Any help would be great.

signalp6 -fasta ${!sample}.fasta -org euk -format txt -m slow-sequential --output_dir ${!sample}_signalP6

Predicting 6/6: 100%|██████████| 69500/69500 [3:50:56<00:00, 5.02sequences/s] Traceback (most recent call last): File "/home/.local/bin/signalp6", line 8, in sys.exit(predict()) File "/home/.local/lib/python3.6/site-packages/signalp/init.py", line 6, in predict main() File "/home/.local/lib/python3.6/site-packages/signalp/predict.py", line 239, in main resolve_viterbi_marginal_conflicts(global_probs, marginal_probs, cleavage_sites, viterbi_paths) File "/home/.local/lib/python3.6/site-packages/signalp/utils.py", line 311, in resolve_viterbi_marginal_conflicts cleavage_sites[i] = sp_idx.max() +1 File "/share/apps/python3-system/lib/python3.6/site-packages/numpy/core/_methods.py", line 39, in _amax return umr_maximum(a, axis, None, out, keepdims, initial, where) ValueError: zero-size array to reduction operation maximum which has no identity

Thanks

fteufel commented 2 years ago

Hi @B10inform , can you provide me with the fasta data for which this occurs? I'll look into it then.

B10inform commented 2 years ago

Hi fteufel,

Here is the link to fasta file. https://solgenomics.net/ftp/genomes/Nicotiana_benthamiana/annotation/Niben101/Niben101_annotation.proteins.fasta.gz

B10inform commented 2 years ago

Hi fteufel,

Were yo able to look into this issue??

Thanks

fteufel commented 2 years ago

Hi, I could not reproduce your error. It must be related to your installation, I reinstalled from the download server and prediction finished without an error. I suspect it is

>ben101Scf02573g00010.1
MKAAAMSTPANAAPPMTALLAAFGGGVLSAVGCSAGEAPGPPAGVGAGGEPARPPAGAGD
GEVVEADGDGVGEVVGDGDGVAVGGDTAGAGTGVDGDGVGEVVGDGDGVAVGGDTAGAGT
GVGVAAGEILGAGAGD

that is causing the problem. It yields a malformed region prediction, but in the current version this only raises a warning and does not crash.