Open jjkoehorst opened 5 years ago
@jjkoehorst I believe you need just the Signature Accession (e.g. PF09103 / G3DSA:2.40.50.140) column (#5) or both this and the Protein Accession (e.g. P51587) column (#1 and #5). It doesn't use the InterPro annotations accession (#12)
See https://github.com/ebi-pf-team/interproscan/wiki/OutputFormats
@rdfinn @happy-lorna Do InterPro member database matches need to be on the same protein?
I'm having some trouble going through your code.
Just wanted to confirm that Signature Accession (e.g. PF09103 / G3DSA:2.40.50.140)'s used as step evidence must be on the same protein?
Or can you just dedupe the Signature Accession column to a unique set and feed that in?
"Do InterPro member database matches need to be on the same protein?" I assume you mean member database matches listed as non-sufficient evidences for the one step? I am fairly sure (@rdfinn will correct me if I am mistaken) that no they do not need to be on the same protein, so in principle yes you could just feed in the unique set of signature accession matches.
Since we store the InterProScan output directly into an RDF database what would be the minimal requirements for creating the TSV input file? I played with the test set and noticed that this should probably be the minimal information? The length / scores etc are not used?