NaegleLab / CoDIAC

GNU General Public License v3.0
0 stars 0 forks source link

Proteins lacking domain architectures preventing reference file generation #57

Closed adshimpi closed 4 weeks ago

adshimpi commented 1 month ago

Description

Proteins lacking annotated domains prevents reference files from being generated, which for most applications will not lead to any issues. However, if a complete proteome is being fetched it leads to the following two issues:

  1. An IndexError associated with the domain dictionaries. A solution for this was generated which revealed the second issue.
  2. A KeyError is thrown as certain data structures used for InterPro domain architectures are not populated with key-value pairs associated with these proteins. This issue is associated with the get_domains function as the domain_string_dict is the output from the get_domains function (See screenshot).

Screenshots

image

Files

InterPro.py at the get_domains function and fetch_InterPro_json

To Reproduce

Steps to reproduce the behavior:

prots = ['P00533',"Q0D2K0","Q8TBZ9"]
uniprot_df = UniProt.makeRefFile(prots, 'Debugging_Test.csv')

Note: While Q0D2K0 has no domain architecture fetching information from InterPro does still lead to a response as it contains family and other structures. However, the ID Q8TBZ9 does not have these information which is also seen on the graphical interface of InterPro (see below):

image

Diagnostic Code that helped pinpoint the problem

prots = ['P00533',"Q0D2K0","Q8TBZ9"]
x = InterPro.fetch_InterPro_json(prots)
d_dict = {}
p_resolved = {}
for p in prots:
    inner_dict = {}
    d_resolved = []
    for i,entry in enumerate(x[p]['results']):
        #print(entry) #Use this to see the actual entry values
        if entry['metadata']['type'] == 'domain':
            inner_dict[i] = InterPro.collect_data(entry)
    d_dict[p] = inner_dict

    values = list(inner_dict.keys())
    if values:
        d_resolved+=InterPro.return_expanded_domains(inner_dict[values[0]])

    for domain_num in values[1:]:

        d_resolved = InterPro.resolve_domain(d_resolved, inner_dict[domain_num])
    p_resolved[p] = d_resolved
    sorted_domain_list, domain_string_list, domain_arch = InterPro.sort_domain_list(d_resolved)
    print(domain_arch)

Code checking the different responses with respect to IDs lacking domain architectures:

Contains some annotations like protein family

image

No annotations result

image

Expected behavior

No errors should be preventing the reference file from being generated and proteins lacking the domain information should still be passed with an empty domain architecture.

Tasks

Include specific tasks in the order they need to be done in. Include links to specific lines of code where the task should happen at, if known