ebi-pf-team / interproscan

Genome-scale protein function classification
Apache License 2.0
302 stars 67 forks source link

Lookup service gives false negative - gives empty hits when there should be (v. 5.66-98.0) #357

Closed cmunk closed 3 months ago

cmunk commented 7 months ago

Main issue

We saw that we got not hits using our local lookup service but did when we avoided it. We have become worried that we cannot trust all the results we've generated using the lookup service. Below is an example of a sequence we saw this with, but for a range of genomes we saw >50% proteins be left unannotated falsely and had to disable lookup service to get our annotations.

Version

We have relied on interproscan and lookup for version 5.66-98.0. (We are aware there is a new version and will also test that once we have it downloaded)

Sequence

We saw that we were missing annotations for the following sequence:

MKTKLTGVALLLAAVSLGSTQPVEACTRAVYIGPEQMVITGRTMDWKEDLHSNLYVFPRGIQRTGHNKEKTLNWTSKYGSIVATGYDIGTCDGMNEKGLVASLLFLPETIYSLPGDTRPVMGISIWTQYVLDNFATVREAVNELKKETFRIDAPRLPNGSESTLHLAITDETGNTAILEYLDGKLSIHEGKQYQVMTNSPRYEYQLAINDYWKEVGGLQMLPGTNRSSDRFVRASFYIHAIPQTSDAKIAVPSVLSVMRNVSVPFGITTPDKPYISSTRWRTVSDQKNKVYYFESTLTPNLFWLDLKKIDFSPKASIKKLSLANGEIYAGDAVKDLKDSKSFTFLFQTPVM

With the following MD5 350CBC1A5F0A7E36297C0497D69C197F

Main question

cmunk commented 6 months ago

We've now downloaded the newest version and spun it up and it does indeed give results:

$ curl http://localhost:8085/version
SERVER:5.67-99.0
$ curl http://localhost:8085/matches?md5=350CBC1A5F0A7E36297C0497D69C197F
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<kvSequenceEntryXML>
    <matches>
        <match>
            <matchId>557607008</matchId>
            <proteinMD5>350CBC1A5F0A7E36297C0497D69C197F</proteinMD5>
            <hit>SUPERFAMILY,1.75,SSF56235,0045430,26,328,26-328-S,0.0,1.39E-95,,0,0,342,0,0,0.0,1.39E-95,</hit>
            <hit>PHOBIUS,1.01,NON_CYTOPLASMIC_DOMAIN,NON_CYTOPLASMIC_DOMAIN,26,351,26-351-S,0.0,0.0,,0,0,0,0,0,0.0,0.0,</hit>
            <hit>PANTHER,18.0,PTHR35527,PTHR35527:SF2,9,344,9-344-S,395.9,6.0E-115,..,6,346,0,1,348,395.9,6.0E-115,AN18</hit>
            <hit>PHOBIUS,1.01,SIGNAL_PEPTIDE_N_REGION,SIGNAL_PEPTIDE_N_REGION,1,4,1-4-S,0.0,0.0,,0,0,0,0,0,0.0,0.0,</hit>
            <hit>SIGNALP_EUK,4.1,SignalP-noTM,SignalP-noTM,1,25,1-25-S,0.765,0.0,,0,0,0,0,0,0.765,0.0,</hit>
            <hit>CDD,3.20,cd01902,cd01902,25,312,25-312-S,481.451,0.0,,0,0,0,0,0,481.451,0.0,</hit>
            <hit>SIGNALP_GRAM_POSITIVE,4.1,SignalP-TM,SignalP-TM,1,25,1-25-S,0.718,0.0,,0,0,0,0,0,0.718,0.0,</hit>
            <hit>PFAM,36.0,PF02275,PF02275,26,310,26-310-S,167.8,1.5E-45,[.,1,307,316,26,319,164.8,1.3E-44,</hit>
            <hit>PHOBIUS,1.01,SIGNAL_PEPTIDE_C_REGION,SIGNAL_PEPTIDE_C_REGION,18,25,18-25-S,0.0,0.0,,0,0,0,0,0,0.0,0.0,</hit>
            <hit>PHOBIUS,1.01,SIGNAL_PEPTIDE,SIGNAL_PEPTIDE,1,25,1-25-S,0.0,0.0,,0,0,0,0,0,0.0,0.0,</hit>
            <hit>SIGNALP_GRAM_NEGATIVE,4.1,SignalP-noTM,SignalP-noTM,1,25,1-25-S,0.719,0.0,,0,0,0,0,0,0.719,0.0,</hit>
            <hit>PHOBIUS,1.01,SIGNAL_PEPTIDE_H_REGION,SIGNAL_PEPTIDE_H_REGION,5,17,5-17-S,0.0,0.0,,0,0,0,0,0,0.0,0.0,</hit>
            <hit>GENE3D,4.3.0,G3DSA:3.60.60.10,3hbcA00,23,341,23-341-S,408.9,6.5E-122,[],2,318,320,23,341,408.7,7.4E-122,</hit>
        </match>
    </matches>
</kvSequenceEntryXML>

My main question then is whether there was an issue with the previous lookup service or if it somehow got "corrupted" at our end?

matthiasblum commented 6 months ago

Hi @cmunk,

Sorry for the late reply. I can reproduce the issue. The sequences is pre-calculated (and is tagged as such in the lookup service), yet no matches are returned.

Starting the service:

$ java -Xmx36000m -jar server-5.66-98.0-jetty-console.war --headless --port 8080

Checking the version (in another terminal):

$ curl 'http://localhost:8080/version'
SERVER:5.66-98.0

Checking whether the sequences has been pre-calculated:

$ curl 'http://localhost:8080/isPrecalculated?md5=350CBC1A5F0A7E36297C0497D69C197F'
350CBC1A5F0A7E36297C0497D69C197F

Getting pre-calculated matches:

$ curl 'http://localhost:8080/matches?md5=350CBC1A5F0A7E36297C0497D69C197F'
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<kvSequenceEntryXML>
    <matches/>
</kvSequenceEntryXML>

It was definitely not a corrupted file on your end, but something that went wrong on ours. I am very sorry about that.

cmunk commented 6 months ago

Thanks for your reply @matthiasblum -- is there a possibility to get a bit insights into the cause and whether we can be more confident that this will not happen in the future? We very much enjoy relying on the lookup service, but also need to be able to trust it.

matthiasblum commented 6 months ago

I asked @tgrego to look into it. He will update this issue.

tgrego commented 6 months ago

Hello @cmunk Thank you for reporting this issue, we could confirm that not only version 5.66-98.0 but a few of our most recent versions were affected by it. The latest version 5.67-99.0 is not affected as far as I can tell. The issue seems be related to newest uniparc sequences and affect a small number of proteins. The example sequence you provided is indeed a high uniparc ID. To help us get a deeper insight of the issue could you provide us with a list of sequences/md5s that have been affected on your previous run? That would really help us if possible.

Thank you and best regards

cmunk commented 6 months ago

hi @tgrego -- Thank you for looking into this -- I sadly do not have the IDs of the false negatives as the data has now been overwritten without using lookup. But yes indeed, it was seemingly a smaller subset of data, though I did not deduce a pattern.

tgrego commented 3 months ago

Issue was missing data on the lookup service for a small but increasing number of high uniparc IDs, affecting versions starting 5.63-95.0 to 5.66-98.0. Issue was fixed from version 5.64-96.0 onwards.