Error Running makeRefFile Function from UniProt.py

juliannahitchcock00 commented 4 months ago

Description

A clear and concise description of what the issue is about. When running the makeRefFile function from UniProt.py, I continue to receive the error stated below. I have tried tracing the error back through the InterPro.py and UniProt.py files, but have not been successful. I am attempting to run this function when employing CoDIAC for the PH and SH3 domains. A couple of months ago, it ran without errors and successfully produced the domain reference file as an output.

Screenshots

Files

UniProt.py
InterPro.py

To Reproduce

Steps to reproduce the behavior:

Generate UniProt ID list with the function fetch_uniprotids from InterPro.py. The parameters I used are (Interpro_ID, REVIEWED=True, species='Homo sapiens') with the Interpro_ID adjusting to be IPR001452 for the SH3 domain and IPR001849 for the PH domain. The other two parameters remain unchanged.
Run makeRefFile function. The parameters I used are (uniprot_IDs, uniprot_reference_file) with the uniprot_IDs as the UniProt ID list returned in step 1 and uniprot_reference_file as a name for the output file.
Observe if error occurs.

For reference, my versions are Python 3.8.8, pandas 1.2.4, and Biopython 1.79.

Expected behavior

A clear and concise description of what you expected to happen. I expected the function to return a domain reference file.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known

[ ] Test the individual steps involved to trace the root of the error. All functions/lines of code are in UniProt.py and InterPro.py. @juliannahitchcock00

knaegle commented 4 months ago

@juliannahitchcock00 can you please add more detail to allow someone else to reproduce it.

For example, what is the Interpro ID? A copy of the specific code you are running would be very helpful.

juliannahitchcock00 commented 4 months ago

Here is a copy of the specific code I am running! Since I cannot attach it as a file, I just pasted the few lines below. This is for SH3 domain. For PH domain, all that changes are some parameters (outlined in the ticket).

start import CoDIAC from CoDIAC import featureTools, InterPro, UniProt, PDB, IntegrateStructure_Reference import pandas as pd Interpro_ID = 'IPR001452' #IPR036028 is the SH3 domain superfamily, IPR001452 is the SH3 domain data_root = 'SH3 Data/CURRENT/' nameroot = 'SH3'+Interpro_ID uniprot_reference_file = data_root+name_root+'_uniprot_reference.csv' # The uniprot reference file name

uniprot_IDs, species_dict = CoDIAC.InterPro.fetch_uniprotids(Interpro_ID, REVIEWED=True, species='Homo sapiens')

uniprot_df = CoDIAC.UniProt.makeRefFile(uniprot_IDs, uniprot_referencefile) end_

adshimpi commented 4 months ago

I had a similar issue when trying to regenerate a couple of reference files due to a handful of unique domain architectures having misannotations in domain names. I was tracking the overall bug and traced it back to the reference file generation and with some more digging traced it to the generateDomainMetadata and collect_data functions within Interpro.py

This did not occur before when I last ran the code in December. However, the reference files that I was replacing were generated in September, so it is likely associated with changes in the InterPro database which had a new release recently on Jan 24th. There is potential it happened in the updates done on November 8th as well.

Source of the problem appears to be domains that are annotated but have no position within the protein itself. Without the entry positions, the collect_data function throws out the message 'boundaries dictionary creates error' and a boundary key is not created within the interpro domain information dictionary used in downstream processes (Line 193 in Interpro.py).

An example of said problem occurs with the UniProt accession number Q96S82.

Code that has helped pinpoint the problem:

import requests
interpro_url = "https://www.ebi.ac.uk/interpro/api"
protein_accession = 'Q96S82'
url = interpro_url + "/entry/interpro/protein/uniprot/" + protein_accession
resp = requests.get(url).json()

entry_list = []
domain_database = ["smart", "pfam", "profile", "prosite", "prints", "cdd", "tigrfams", "sfld", "panther"]

for entry in resp['results']:

    if entry['metadata']['type'] != 'domain':
        continue

    info_dict = {}
    interpro_dict = CoDIAC.InterPro.collect_data(entry, protein_accession, domain_database)
    info_dict['interpro'] = interpro_dict
    entry_list.append(info_dict)

What entry_list looks like:

If looking at the response from interpro at the first two entries that includes both a successful and unsuccessful domain entry:

Successful:

Unsuccessful:

For the unsuccessful there is None for the entry_protein_locations which is likely creating the issues.

Potential solutions:

Have the catch statement in collect_data still create a boundary key (e.g. dictionary['boundary'] = []) and have downstream processes handle empty entries.
Add in code to remove domains that lack position information after the response.

juliannahitchcock00 commented 3 months ago

To add to what @adshimpi posted, I am finding similar examples of successful and unsuccessful entries.

Here is an example of successful:

Here is an example of unsuccessful:

Here is the list of UniProt accession numbers that are unsuccessful: ['P18206', 'P27986', 'Q12959', 'Q15811', 'Q969F8', 'Q9UQB8', 'Q9UQF2']

knaegle commented 1 month ago

I believe this issue was closed when we handled protein domain ordering and hierarchy.

NaegleLab / CoDIAC