RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

Add ability to record "original_object" and "original_subject" to every association #249

Open DnlRKorn opened 2 months ago

DnlRKorn commented 2 months ago

MonarchKG has the following properties recorded on each association they record.

image

This should be fairly trivial with changing the following lines: https://github.com/RobokopU24/ORION/blob/8d8b643284e70e23c6bb5e2bb48425c9bc949ee4/Common/loader_interface.py#L28-31 becomes:

    def __init__(self, test_mode: bool = False, audit_mode: bool = False, source_data_dir: str = None):
        """Initialize with the option to run in testing mode."""
        self.test_mode: bool = test_mode
        self.audit_mode: bool = audit_mode

and

https://github.com/RobokopU24/ORION/blob/8d8b643284e70e23c6bb5e2bb48425c9bc949ee4/Common/kgx_file_writer.py#L138-144 becomes:

    def write_kgx_edge(self, edge: kgxedge):
        edge_properties = edge.properties
        if(self.audit_mode):
            edge_properties["original_object"] = edge.objectid
            edge_properties["original_subject"] = edge.subjectid
        self.write_edge(subject_id=edge.subjectid,
                        object_id=edge.objectid,
                        predicate=edge.predicate,
                        primary_knowledge_source=edge.primary_knowledge_source,
                        aggregator_knowledge_sources=edge.aggregator_knowledge_sources,
                        edge_properties=edge_properties)
EvanDietzMorris commented 1 month ago

Is the idea that the "original" ids are just pre-normalization, or is this something coming from the source upstream?

If the former, it might make sense to add them during the normalization phase, and that could easily be incorporated into the NormalizationScheme, which would let us easily specify in Graph Specs whether we want them or not.

I worry about altering the kgx file writer for this purpose on a mode based level like that, because for example, someone might use that write_kgx_edge function on post-normalized nodes without realizing it would do that, creating bogus original ids.

We used to have original ids on every edge, and in many cases it can be helpful for quicker troubleshooting etc, but we removed them when we started saving normalization maps for every run.. We could possibly just implement this for every edge again and not worry about a mode or configuration. What do you think @cbizon ?