RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

Add beginning of RDF file writer. #165

Open balhoff opened 1 year ago

balhoff commented 1 year ago

I took a stab at implementing an RDF file writer (just for edges, not nodes at the moment—I don't think we want to have duplicate node metadata in different RDF datasets). @EvanDietzMorris I have not actually run this; could you let me know if I'm on the right track, and what else needs to be done to output some Turtle files in the ORION build?

EvanDietzMorris commented 1 year ago

I'm not sure I understand the issue with nodes, we may want to chat about that. Looks like this is on the right track though.

I think the fastest/cleanest way would be to add a condition here for rdf: https://github.com/RobokopU24/ORION/blob/81b2988c2f3a1174d461ec908f4e17efa76d81c5/Common/build_manager.py#L108

It could read from the jsonl nodes and edges files that were produced in the previous merging step as a completed graph (graph_output_dir/NODES_FILENAME and EDGES_FILENAME) and write them out in rdf. It might be nice to just make a file conversion helper like kgx_file_converter.py has for jsonl to csv.

Then we could specify rdf as output format for a graph like here: https://github.com/RobokopU24/ORION/blob/81b2988c2f3a1174d461ec908f4e17efa76d81c5/graph_specs/default-graph-spec.yml#L15

This approach has the downside that if rdf is the only output you care about, it's going to merge the sources and write them to kgx jsonl files first for no great reason. We could also incorporate the rdf output further upstream to avoid that but I haven't had time to think about how we might want to do that.