Micromeda / pygenprop

A python library for programmatic usage of EBI InterPro Genome Properties.
http://pygenprop.rtfd.io/
Apache License 2.0
9 stars 4 forks source link

PrositePatterns have no e-value and cause a value error when they are inserted into Micromeda files. #74

Closed LeeBergstrand closed 3 years ago

LeeBergstrand commented 3 years ago

Problem Description

PrositePatterns is a pattern matching protein database. Partial matches are not supported only exact matches. Therefore, because patterns only match or not the score (e-value) column of the InterProScan TSV is left blank. Pygenprop is not currently compatible with leaving the InterProScan column blank.

Errors

If an e value column contains no value, parsing fails because a blank e-value gets recorded as a np.nan and np.nan cannot be written to the Micromeda file' (SQLite) e-value column.

  File "/Users/lee/Dropbox/RandD/Repositories/pygenprop/pygenprop/results.py", line 721, in connect_step_assignments_to_interproscan_matches
    current_interproscan = unique_interproscan_dict[interpro_signature][protein_identifier][e_value]
KeyError: nan

Problem Solution

Temporary Solution

Python script to sanitize pro sites from InterProScan TSVs: https://gist.github.com/LeeBergstrand/d429041fa50698fec5a83ddb2a295ed0

Long Term Solution

TODO - Edit Pygenprop to sanitize TSVs internally.

LeeBergstrand commented 3 years ago

Fixed by https://github.com/Micromeda/pygenprop/pull/75