ebi-pf-team / genome-properties

GNU General Public License v3.0
12 stars 12 forks source link

Recreating InterProScan output #52

Open jjkoehorst opened 5 years ago

jjkoehorst commented 5 years ago

Since we store the InterProScan output directly into an RDF database what would be the minimal requirements for creating the TSV input file? I played with the test set and noticed that this should probably be the minimal information? The length / scores etc are not used?

id1 id11            PF00166                         IPR020818       GO:0005737|GO:0006457   
id2 id22            TIGR02348                           IPR001844       GO:0005737|GO:0042026   Reactome: R-HSA-1268020|Reactome: R-HSA-8869496
LeeBergstrand commented 5 years ago

@jjkoehorst I believe you need just the Signature Accession (e.g. PF09103 / G3DSA:2.40.50.140) column (#5) or both this and the Protein Accession (e.g. P51587) column (#1 and #5). It doesn't use the InterPro annotations accession (#12)

See https://github.com/ebi-pf-team/interproscan/wiki/OutputFormats

@rdfinn @happy-lorna Do InterPro member database matches need to be on the same protein?

I'm having some trouble going through your code.

Just wanted to confirm that Signature Accession (e.g. PF09103 / G3DSA:2.40.50.140)'s used as step evidence must be on the same protein?

Or can you just dedupe the Signature Accession column to a unique set and feed that in?

LornaMGnify commented 5 years ago

"Do InterPro member database matches need to be on the same protein?" I assume you mean member database matches listed as non-sufficient evidences for the one step? I am fairly sure (@rdfinn will correct me if I am mistaken) that no they do not need to be on the same protein, so in principle yes you could just feed in the unique set of signature accession matches.