Closed juliannahitchcock00 closed 1 month ago
@juliannahitchcock00 can you please add more detail to allow someone else to reproduce it.
For example, what is the Interpro ID? A copy of the specific code you are running would be very helpful.
Here is a copy of the specific code I am running! Since I cannot attach it as a file, I just pasted the few lines below. This is for SH3 domain. For PH domain, all that changes are some parameters (outlined in the ticket).
start import CoDIAC from CoDIAC import featureTools, InterPro, UniProt, PDB, IntegrateStructure_Reference import pandas as pd Interpro_ID = 'IPR001452' #IPR036028 is the SH3 domain superfamily, IPR001452 is the SH3 domain data_root = 'SH3 Data/CURRENT/' nameroot = 'SH3'+Interpro_ID uniprot_reference_file = data_root+name_root+'_uniprot_reference.csv' # The uniprot reference file name
uniprot_IDs, species_dict = CoDIAC.InterPro.fetch_uniprotids(Interpro_ID, REVIEWED=True, species='Homo sapiens')
uniprot_df = CoDIAC.UniProt.makeRefFile(uniprot_IDs, uniprot_referencefile) end_
I had a similar issue when trying to regenerate a couple of reference files due to a handful of unique domain architectures having misannotations in domain names. I was tracking the overall bug and traced it back to the reference file generation and with some more digging traced it to the generateDomainMetadata and collect_data functions within Interpro.py
This did not occur before when I last ran the code in December. However, the reference files that I was replacing were generated in September, so it is likely associated with changes in the InterPro database which had a new release recently on Jan 24th. There is potential it happened in the updates done on November 8th as well.
Source of the problem appears to be domains that are annotated but have no position within the protein itself. Without the entry positions, the collect_data function throws out the message 'boundaries dictionary creates error' and a boundary key is not created within the interpro domain information dictionary used in downstream processes (Line 193 in Interpro.py).
An example of said problem occurs with the UniProt accession number Q96S82.
Code that has helped pinpoint the problem:
import requests
interpro_url = "https://www.ebi.ac.uk/interpro/api"
protein_accession = 'Q96S82'
url = interpro_url + "/entry/interpro/protein/uniprot/" + protein_accession
resp = requests.get(url).json()
entry_list = []
domain_database = ["smart", "pfam", "profile", "prosite", "prints", "cdd", "tigrfams", "sfld", "panther"]
for entry in resp['results']:
if entry['metadata']['type'] != 'domain':
continue
info_dict = {}
interpro_dict = CoDIAC.InterPro.collect_data(entry, protein_accession, domain_database)
info_dict['interpro'] = interpro_dict
entry_list.append(info_dict)
What entry_list looks like:
If looking at the response from interpro at the first two entries that includes both a successful and unsuccessful domain entry:
Successful:
Unsuccessful:
For the unsuccessful there is None for the entry_protein_locations which is likely creating the issues.
Potential solutions:
To add to what @adshimpi posted, I am finding similar examples of successful and unsuccessful entries.
Here is an example of successful:
Here is an example of unsuccessful:
Here is the list of UniProt accession numbers that are unsuccessful: ['P18206', 'P27986', 'Q12959', 'Q15811', 'Q969F8', 'Q9UQB8', 'Q9UQF2']
I believe this issue was closed when we handled protein domain ordering and hierarchy.
Description
A clear and concise description of what the issue is about. When running the makeRefFile function from UniProt.py, I continue to receive the error stated below. I have tried tracing the error back through the InterPro.py and UniProt.py files, but have not been successful. I am attempting to run this function when employing CoDIAC for the PH and SH3 domains. A couple of months ago, it ran without errors and successfully produced the domain reference file as an output.
Screenshots
Files
To Reproduce
Steps to reproduce the behavior:
For reference, my versions are Python 3.8.8, pandas 1.2.4, and Biopython 1.79.
Expected behavior
A clear and concise description of what you expected to happen. I expected the function to return a domain reference file.
Tasks
Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known