Micromeda / pygenprop

A python library for programmatic usage of EBI InterPro Genome Properties.
http://pygenprop.rtfd.io/
Apache License 2.0
9 stars 4 forks source link

Pygenprop no longer compatible with the latest InterProScan TSV files #73

Closed LeeBergstrand closed 3 years ago

LeeBergstrand commented 3 years ago

Problem Description

Previous version of InterProScan generated a regular TSV file where null values are left blank. However, as of InterProScan TSV version 5.48-83.0 these blank values have been replaced by -.

Before

protein_accession MD5 Sequence Analysis sig_accession description Start Stop Score Status Date ipr_accession ipr_annotations GO Pathways
1297 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 ProSitePatterns PS01099 Respiratory-chain NADH dehydrogenase 24 Kd sub... 112 130 T 16-01-2021 IPR002023 NADH-quinone oxidoreductase subunit E-like GO:0016491|GO:0055114 KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1298 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 Gene3D G3DSA:3.40.30.10 Glutaredoxin 73 165 7.300000e-27 T 16-01-2021 IPR036249 Thioredoxin-like superfamily KEGG: 00053|KEGG: 00073|KEGG: 00190|KEGG: 0027...
1299 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 TIGRFAM TIGR01958 nuoE_fam: NADH-quinone oxidoreductase, E subunit 11 154 1.500000e-53 T 16-01-2021 IPR002023 NADH-quinone oxidoreductase subunit E-like GO:0016491|GO:0055114 KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1300 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 Pfam PF01257 Thioredoxin-like [2Fe-2S] ferredoxin 12 154 7.100000e-54 T 16-01-2021
1301 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 Gene3D G3DSA:1.10.10.1590 1 72 2.700000e-25 T 16-01-2021 IPR041921 NADH-quinone oxidoreductase subunit E, N-terminal KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...

After

protein_accession MD5 Sequence Analysis sig_accession description Start Stop Score Status Date ipr_accession ipr_annotations GO Pathways
1297 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 ProSitePatterns PS01099 Respiratory-chain NADH dehydrogenase 24 Kd sub... 112 130 - T 16-01-2021 IPR002023 NADH-quinone oxidoreductase subunit E-like GO:0016491|GO:0055114 KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1298 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 Gene3D G3DSA:3.40.30.10 Glutaredoxin 73 165 7.300000e-27 T 16-01-2021 IPR036249 Thioredoxin-like superfamily - KEGG: 00053|KEGG: 00073|KEGG: 00190|KEGG: 0027...
1299 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 TIGRFAM TIGR01958 nuoE_fam: NADH-quinone oxidoreductase, E subunit 11 154 1.500000e-53 T 16-01-2021 IPR002023 NADH-quinone oxidoreductase subunit E-like GO:0016491|GO:0055114 KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1300 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 Pfam PF01257 Thioredoxin-like [2Fe-2S] ferredoxin 12 154 7.100000e-54 T 16-01-2021 - - - -
1301 CABVRV010000017.1_10 2998ec28820954ffa7eae7fc4847da77 181 Gene3D G3DSA:1.10.10.1590 - 1 72 2.700000e-25 T 16-01-2021 IPR041921 NADH-quinone oxidoreductase subunit E, N-terminal - KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...

From the InterProScan docs: "If a value is missing in a column, for example, the match has no InterPro annotation, a ‘-‘ is displayed."

Errors

If an e value column contains no value, parsing fails

sqlalchemy.exc.StatementError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(builtins.ValueError) could not convert string to float: '-'
[SQL: INSERT INTO interproscan_matches (sequence_identifier, interpro_signature, expected_value) VALUES (?, ?, ?)]
[parameters: [{'sequence_identifier': 'CABVRV010000017.1_10', 'interpro_signature': 'PS01099', 'expected_value': '-'}]]

Problem Solution

The - must be removed before the TSV file's underlying data is integrated into Pygenprop. TSV values that only contain - should be changed to `. No other-` should be removed.

Temporary Solution

Python script to sanitize newer TSV: https://gist.github.com/LeeBergstrand/d429041fa50698fec5a83ddb2a295ed0

Long Term Solution

TODO - Edit Pygenprop to sanitize TSVs internally.

LeeBergstrand commented 3 years ago

Fixed by https://github.com/Micromeda/pygenprop/pull/75