RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

get_latest_source_version() being called for data sources that aren't needed for graph being built. #236

Open DnlRKorn opened 2 months ago

DnlRKorn commented 2 months ago

If you execute build_manager on the Testing_Baseline python build_manager.py Testing_Baseline for a slightly modified version of testing-graph-spec.yml We see the following output

2024-07-23 11:49:46,328 - get_latest_source_version(): Retrieving latest source version for CTD...
2024-07-23 11:49:46,518 - get_latest_source_version(): Found latest source version for CTD: June_2024
2024-07-23 11:49:46,623 - get_latest_source_version(): Retrieving latest source version for GtoPdb...
2024-07-23 11:49:47,735 - get_latest_source_version(): Found latest source version for GtoPdb: 2024.2
2024-07-23 11:49:47,736 - build_graph(): Building graph Testing_Baseline. Checking dependencies...
2024-07-23 11:49:47,738 - build_graph(): Building graph Testing_Baseline. Dependencies are ready...

We see that GtoPdb gets it's latest source version established despite not being used for the Testing_Baseline graph_spec. Digging into this deeper; here is a traceback from when get_latest_source_version() is called on GtoPdb

    graph_builder = GraphBuilder()
  File "/home/dkorn/BUILD_COMPARE/ORION/build_manager.py", line 41, in __init__
    self.graph_specs = self.load_graph_specs()  # list of graphs to build (GraphSpec objects)
  File "/home/dkorn/BUILD_COMPARE/ORION/build_manager.py", line 314, in load_graph_specs
    return self.parse_graph_spec(graph_spec_yaml)
  File "/home/dkorn/BUILD_COMPARE/ORION/build_manager.py", line 339, in parse_graph_spec
    data_sources = [self.parse_data_source_spec(data_source) for data_source in graph_yaml['sources']] \
  File "/home/dkorn/BUILD_COMPARE/ORION/build_manager.py", line 426, in parse_data_source_spec
    else self.source_data_manager.get_latest_source_version(source_id)
  File "/home/dkorn/BUILD_COMPARE/ORION/Common/load_manager.py", line 129, in get_latest_source_version
    if source_id in self.latest_source_version_lookup:

If a graph_spec is in the yaml file read by build_manager.py, it will be parsed (self.graph_specs = self.load_graph_specs() and return self.parse_graph_spec(graph_spec_yaml)). The potential issue with this is that get_latest_source_version can be be called (else self.source_data_manager.get_latest_source_version(source_id)), even if it's not necessary for the specific graph_id being built. Calling this is probably out of scope for the parser as it often requires downloading the current version of the file.

EvanDietzMorris commented 2 months ago

This is point number 3 of #227 .. it's definitely nasty and would be easy to fix but haven't gotten around to it