Error:U in protein sequence

dayuer2010 commented 6 years ago

Dear dansondergaard, first congratulation you developed such a good software,I have installed it.if I used the test.fa,it will work successful ,But If I used a protein with a "U" in sequence,it doesn't work and produce a error. but if I put this protein with "U" in TMHMM website,it will work and produce a prediction result. can you help how to solve this problem?

dansondergaard commented 6 years ago

Can you please copy-paste the error message you get? Thanks.

DavidVillalta commented 6 years ago

Similar to dayuer2010's comment, it does not respond well to 'Z', 'X', 'B', or '-' (gaps), though the website seems to handle these all well enough. I'm attempting to use the code as an import library for python3.6 (see lines below for the error-type), though errors for these characters appear when run through command-line as well. Otherwise, without these characters, it runs and produces the outputs one would expect, but the RuntimeWarning: divide by zero.. etc. appears just the same. I hope this helps shed some light on things. Thanks very much for making it available.

In [18]: annotation, posterior = tmhmm.predict('MREXNNQSSTLEFILLGVTGQQEQEDFFYILFLFIYPITLIGNLLIVLAICSDVRLHNPMYFLLANLSLVDIFFSSVTIPKMLANHLLGSKSISFGGCLTQMYFMIALGNTDSYILAAMAYDRAVAISRPLHYTTIMSPRSCIWLIAGSWVIGNANALPHTLLTASLSFCGNQEVANFYCDITPLLKLSCSDIHFHVKMMYLGVGIFSVPLLCIIVSYIRVFSTVFQVPSTKGVLKAFSTCGSHLTVVSLYYGTVMGTYFRPLTNYSLKDAVITVMYTAVTPMLNPFIYSLRNRDMKAALRKLFNKRISS', '/foo/foo/foo/bar/TMHMM2.0.model')
/foo/foo/anaconda3/lib/python3.6/site-packages/tmhmm/__init__.py:21: RuntimeWarning: divide by zero encountered in log
  _, path = viterbi(sequence, *model)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-e2aa6964c763> in <module>()
----> 1 annotation, posterior = tmhmm.predict('MREXNNQSSTLEFILLGVTGQQEQEDFFYILFLFIYPITLIGNLLIVLAICSDVRLHNPMYFLLANLSLVDIFFSSVTIPKMLANHLLGSKSISFGGCLTQMYFMIALGNTDSYILAAMAYDRAVAISRPLHYTTIMSPRSCIWLIAGSWVIGNANALPHTLLTASLSFCGNQEVANFYCDITPLLKLSCSDIHFHVKMMYLGVGIFSVPLLCIIVSYIRVFSTVFQVPSTKGVLKAFSTCGSHLTVVSLYYGTVMGTYFRPLTNYSLKDAVITVMYTAVTPMLNPFIYSLRNRDMKAALRKLFNKRISS', '/foo/foo/foo/bar/TMHMM2.0.model')

~/anaconda3/lib/python3.6/site-packages/tmhmm/__init__.py in predict(sequence, model_or_filelike, compute_posterior)
     19         _, model = parse(open(model_or_filelike))
     20 
---> 21     _, path = viterbi(sequence, *model)
     22     if compute_posterior:
     23         forward_table, constants = forward(sequence, *model)

tmhmm/hmm.pyx in tmhmm.hmm.viterbi()

KeyError: 'X'

dansondergaard commented 6 years ago

Thanks, @DavidVillalta! I'll have a look at it this week.

dansondergaard commented 6 years ago

@DavidVillalta, it seems that TMHMM handles this in a pretty weird way (at least I can't figure out how they get their results). I tested with the sequence XXXBBBUUU---ZZZ.

TMHMM web server output:

# WEBSEQUENCE
# AA    inside  membr   outside
1 X 0.52190 0.00000 0.4781
2 X 0.52190 0.00000 0.4781
3 X 0.52190 0.00000 0.4781
4 B 0.52190 0.00000 0.4781
5 B 0.52190 0.00000 0.4781
6 B 0.52190 0.00000 0.4781
7 X 0.52190 0.00000 0.4781
8 X 0.52190 0.00000 0.4781
9 X 0.52190 0.00000 0.4781
10 Z    0.52190 0.00000 0.4781
11 Z    0.52190 0.00000 0.4781
12 Z    0.52190 0.00000 0.4781

So it seems that they stripped the gaps (-), but kept everything else. However, it's very difficult to figure out how they handle this as it's not documented anywhere. Would you be happy with a solution where the ambiguous characters are just stripped?

DavidVillalta commented 6 years ago

I had imagined the ambiguous characters get assigned a score based on their their neighbors scores, plus the proximity to the end/beginning of a predicted TMH (perhaps by taking an average TMH-length, either specific to the protein or a generalized one), but indeed, they are all ambiguous and yet they get assigned a score, curious. After putting in a request http://www.cbs.dtu.dk/cgi-bin/nph-sw_request?tmhmm it appears that the standalone version handled it too (except gaps), though I do not speak Perl, and cannot say how. Unfortunately, I will need to preserve at least two ambiguous variables and I work with Python primarily, so it was nice to find your adaptation to the language.

dansondergaard commented 6 years ago

Could you e-mail me the Perl implementation? Maybe I can figure out how they handled it from the code, even though I'm not very familiar with Perl either.

DavidVillalta commented 6 years ago

Sorry for providing a broken-link, but I think I put it in correctly, now. For a download of the script, just fill out with the form with an academic e-mail address. The reply with a download link is automated and near instantaneous. I would e-mail it to you, but the license agreement is pretty explicit about sharing it outside of my "research site".

DavidVillalta commented 6 years ago

@dansondergaard I think I am going to be able to use this, as is, after-all. Thanks for the help. I do have one more question though, where is this line "compute_posterior=False" supposed to be inserted if I want to turn off these outputs? EDIT: Figured it out, thanks anyway.

dansondergaard commented 6 years ago

Hi Dave, good to hear that it can be used anyway! It’s supposed to go in the call to predict:

annotation, posterior = tmhmm.predict(sequence, ‘mymodel.model’, compute_posterior=False)

0yliu commented 6 years ago

Hi, dansondergaard,

Thank you for putting the effort creating this python package. Have you figured out the bug about U? As I am also encountered similar issue.

Thanks a lot!

dansondergaard commented 6 years ago

@0yliu I don't have any plans to fix this at the moment since it's hard to figure out how TMHMM handles these cases. None of it is documented anywhere.

dansondergaard / tmhmm.py

Error:U in protein sequence #9