Proteins lacking annotated domains prevents reference files from being generated, which for most applications will not lead to any issues. However, if a complete proteome is being fetched it leads to the following two issues:
An IndexError associated with the domain dictionaries. A solution for this was generated which revealed the second issue.
A KeyError is thrown as certain data structures used for InterPro domain architectures are not populated with key-value pairs associated with these proteins. This issue is associated with the get_domains function as the domain_string_dict is the output from the get_domains function (See screenshot).
Screenshots
Files
InterPro.py at the get_domains function and fetch_InterPro_json
Note: While Q0D2K0 has no domain architecture fetching information from InterPro does still lead to a response as it contains family and other structures. However, the ID Q8TBZ9 does not have these information which is also seen on the graphical interface of InterPro (see below):
Diagnostic Code that helped pinpoint the problem
prots = ['P00533',"Q0D2K0","Q8TBZ9"]
x = InterPro.fetch_InterPro_json(prots)
d_dict = {}
p_resolved = {}
for p in prots:
inner_dict = {}
d_resolved = []
for i,entry in enumerate(x[p]['results']):
#print(entry) #Use this to see the actual entry values
if entry['metadata']['type'] == 'domain':
inner_dict[i] = InterPro.collect_data(entry)
d_dict[p] = inner_dict
values = list(inner_dict.keys())
if values:
d_resolved+=InterPro.return_expanded_domains(inner_dict[values[0]])
for domain_num in values[1:]:
d_resolved = InterPro.resolve_domain(d_resolved, inner_dict[domain_num])
p_resolved[p] = d_resolved
sorted_domain_list, domain_string_list, domain_arch = InterPro.sort_domain_list(d_resolved)
print(domain_arch)
Code checking the different responses with respect to IDs lacking domain architectures:
Contains some annotations like protein family
No annotations result
Expected behavior
No errors should be preventing the reference file from being generated and proteins lacking the domain information should still be passed with an empty domain architecture.
Tasks
Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known
[ ] Determine if the behavior is arising due to either the get_domains or fetch_InterPro_json functions
[ ] If problem is with get_domains, ensure that an empty vector gets passed on for the accession ID. If the problem is with fetch_InterPro_json, then change the except code block.
Description
Proteins lacking annotated domains prevents reference files from being generated, which for most applications will not lead to any issues. However, if a complete proteome is being fetched it leads to the following two issues:
Screenshots
Files
InterPro.py at the get_domains function and fetch_InterPro_json
To Reproduce
Steps to reproduce the behavior:
Note: While Q0D2K0 has no domain architecture fetching information from InterPro does still lead to a response as it contains family and other structures. However, the ID Q8TBZ9 does not have these information which is also seen on the graphical interface of InterPro (see below):
Diagnostic Code that helped pinpoint the problem
Code checking the different responses with respect to IDs lacking domain architectures:
Contains some annotations like protein family
No annotations result
Expected behavior
No errors should be preventing the reference file from being generated and proteins lacking the domain information should still be passed with an empty domain architecture.
Tasks
Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known