Open saramsey opened 4 years ago
@saramsey Is there currently a spot in KG2 to store ECO codes/evidence codes/confidence scores/etc? I'm assuming it would go in the publications info dictionary, but I don't know what the key would be? (GO -- RTXteam/RTX#838 -- has ECO codes and evidence codes; CTD -- RTXteam/RTX-KG2#39 -- has confidence scores)
@saramsey Is there currently a spot in KG2 to store ECO codes/evidence codes/confidence scores/etc? I'm assuming it would go in the publications info dictionary, but I don't know what the key would be? (GO -- RTXteam/RTX#838 -- has ECO codes and evidence codes; CTD -- RTXteam/RTX-KG2#39 -- has confidence scores)
I am not sure; I will ask the Data Modeling Group
OK I have posted about this on the Translator slack, in the #datamodeling channel:
Here is Deepak's reply:
apparently we should use the evidence attribute
In KG2, let's store the GO evidence code in an edge slot evidence
. I think we should store the GO evidence code as as a CURIE ID, like this: GO.EC:IDA
. In curies-to-urls-map.yaml
, we can map the CURIE prefix GO.EC
to the base URL http://www-legacy.geneontology.org/GO.evidence.shtml#
In KG2, let's store the GO evidence code in an edge slot
evidence
. I think we should store the GO evidence code as as a CURIE ID, like this:GO.EC:IDA
. Incuries-to-urls-map.yaml
, we can map the CURIE prefixGO.EC
to the base URLhttp://www-legacy.geneontology.org/GO.evidence.shtml#
Hi @saramsey, will both GO evidence codes and ECO evidence codes go in this field? If so, should it be a list? Proposed change to kg2_util:
def make_edge(subject_id: str,
object_id: str,
relation_curie: str,
predicate_label: str,
provided_by: str,
update_date: str = None):
return {'subject': subject_id,
'object': object_id,
'edge_label': predicate_label,
'relation': relation_curie,
'negated': False,
'publications': [],
'publications_info': {},
'update_date': update_date,
'provided_by': provided_by,
'evidence': []}
evidence
can be manipulated within each ETL script as necessary and added to an edge via edge['evidence'] = some_list
.
Yes, let's go with a list of CURIE IDs. Thank you.
Hi @saramsey, I saw this graphic in the All Things Provenance breakout group:
Does that mean the entry should be has_evidence
rather than evidence
?
Hi @saramsey, I saw this graphic in the All Things Provenance breakout group:
Does that mean the entry should be
has_evidence
rather thanevidence
?
Good catch. Sure, we can use has_evidence
. It may make our life (specifically, export to TSV and import into Neo4j and mediKanren) simpler if we have every KG2 edge dict have a has_evidence
list, which by default will be empty. Does that make sense?
Hi @saramsey, That makes sense. I will commit that change to a branch so that you can review it, if that works for you.
@ericawood thinks that DrugBank has some evidence code type things that could be put in the has_evidence
property.
Need to audit the KG2 code base to make sure that in every module where we create an edge, we are either doing so via kg2_util.make_edge or that we add the new edge property has_evidence
somehow
This issue seems like a good one to bring back to the front-burner.
UniprotKB has a bunch of evidence codes too; some are associated with the names (which I moved to the description as part of RTXteam/RTX#1171), and others are associated with gene synonyms (isolated but not used anywhere as a part of RTXteam/RTX#1259)
Example of evidence code associated with a gene synonym: UniProtKB:Q9Y4F9
GN Name=RIPOR2;
GN Synonyms=C6orf32, DIFF48, FAM65B, KIAA0386,
GN PL48 {ECO:0000303|PubMed:9055809};
Types of evidence codes are documented here
Sources that do NOT appear to have evidence codes (this will be a long, edited comment as I gather more information):
requested by UAB/PMI team