Previous version of InterProScan generated a regular TSV file where null values are left blank. However, as of InterProScan TSV version 5.48-83.0 these blank values have been replaced by -.
Before
protein_accession
MD5
Sequence
Analysis
sig_accession
description
Start
Stop
Score
Status
Date
ipr_accession
ipr_annotations
GO
Pathways
1297
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
ProSitePatterns
PS01099
Respiratory-chain NADH dehydrogenase 24 Kd sub...
112
130
T
16-01-2021
IPR002023
NADH-quinone oxidoreductase subunit E-like
GO:0016491|GO:0055114
KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1298
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
Gene3D
G3DSA:3.40.30.10
Glutaredoxin
73
165
7.300000e-27
T
16-01-2021
IPR036249
Thioredoxin-like superfamily
KEGG: 00053|KEGG: 00073|KEGG: 00190|KEGG: 0027...
1299
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
TIGRFAM
TIGR01958
nuoE_fam: NADH-quinone oxidoreductase, E subunit
11
154
1.500000e-53
T
16-01-2021
IPR002023
NADH-quinone oxidoreductase subunit E-like
GO:0016491|GO:0055114
KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1300
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
Pfam
PF01257
Thioredoxin-like [2Fe-2S] ferredoxin
12
154
7.100000e-54
T
16-01-2021
1301
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
Gene3D
G3DSA:1.10.10.1590
1
72
2.700000e-25
T
16-01-2021
IPR041921
NADH-quinone oxidoreductase subunit E, N-terminal
KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
After
protein_accession
MD5
Sequence
Analysis
sig_accession
description
Start
Stop
Score
Status
Date
ipr_accession
ipr_annotations
GO
Pathways
1297
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
ProSitePatterns
PS01099
Respiratory-chain NADH dehydrogenase 24 Kd sub...
112
130
-
T
16-01-2021
IPR002023
NADH-quinone oxidoreductase subunit E-like
GO:0016491|GO:0055114
KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1298
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
Gene3D
G3DSA:3.40.30.10
Glutaredoxin
73
165
7.300000e-27
T
16-01-2021
IPR036249
Thioredoxin-like superfamily
-
KEGG: 00053|KEGG: 00073|KEGG: 00190|KEGG: 0027...
1299
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
TIGRFAM
TIGR01958
nuoE_fam: NADH-quinone oxidoreductase, E subunit
11
154
1.500000e-53
T
16-01-2021
IPR002023
NADH-quinone oxidoreductase subunit E-like
GO:0016491|GO:0055114
KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
1300
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
Pfam
PF01257
Thioredoxin-like [2Fe-2S] ferredoxin
12
154
7.100000e-54
T
16-01-2021
-
-
-
-
1301
CABVRV010000017.1_10
2998ec28820954ffa7eae7fc4847da77
181
Gene3D
G3DSA:1.10.10.1590
-
1
72
2.700000e-25
T
16-01-2021
IPR041921
NADH-quinone oxidoreductase subunit E, N-terminal
-
KEGG: 00190|KEGG: 01100|MetaCyc: PWY-3781|Meta...
From the InterProScan docs: "If a value is missing in a column, for example, the match has no InterPro annotation, a ‘-‘ is displayed."
Errors
If an e value column contains no value, parsing fails
sqlalchemy.exc.StatementError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(builtins.ValueError) could not convert string to float: '-'
[SQL: INSERT INTO interproscan_matches (sequence_identifier, interpro_signature, expected_value) VALUES (?, ?, ?)]
[parameters: [{'sequence_identifier': 'CABVRV010000017.1_10', 'interpro_signature': 'PS01099', 'expected_value': '-'}]]
Problem Solution
The - must be removed before the TSV file's underlying data is integrated into Pygenprop. TSV values that only contain - should be changed to `. No other-` should be removed.
Problem Description
Previous version of InterProScan generated a regular TSV file where null values are left blank. However, as of InterProScan TSV version
5.48-83.0
these blank values have been replaced by-
.Before
After
From the InterProScan docs: "If a value is missing in a column, for example, the match has no InterPro annotation, a ‘-‘ is displayed."
Errors
If an e value column contains no value, parsing fails
Problem Solution
The
-
must be removed before the TSV file's underlying data is integrated into Pygenprop. TSV values that only contain-
should be changed to`. No other
-` should be removed.Temporary Solution
Python script to sanitize newer TSV: https://gist.github.com/LeeBergstrand/d429041fa50698fec5a83ddb2a295ed0
Long Term Solution
TODO - Edit Pygenprop to sanitize TSVs internally.