RobokopU24 / ORION

Code that parses datasets from various sources and converts them to load graph databases.
MIT License
12 stars 13 forks source link

DrugMechDB parsing of full database #219

Closed beasleyjonm closed 5 months ago

beasleyjonm commented 6 months ago

"Updated parser to consolidate all relevant drugmechdb path ids into list on unique edges."

EvanDietzMorris commented 6 months ago

Aside from the concerns we discussed yesterday (manual node mappings/normalization should not be part of parsers), we can use this as is, but there are some efficiency issues which will make this slower and use more memory than necessary:

  1. The predicate and node mapping files could just be dictionaries that are imported instead of reading them as files.

  2. It's writing an intermediate csv file and then extracting from it, which is slower and takes up more storage space than writing directly to the output files.

  3. Unless I'm missing something, there is no need to store everything in memory, we can just iterate through the graphs in the source, convert the ids and predicates, and write the nodes and edges directly to file. As far as I can tell, the only part where collecting them in memory first helps is accumulating the drugmech ids on matching edges, but matching edges will be merged and properties that are lists will be concatenated during the normalization phase, so writing out multiple edges that will later be merged is no problem, represents the original source more accurately, and allows us to track how many mergers occurred.

  4. From my perspective the usage of pandas is unnecessary.

Let's go ahead and merge this so we can use it right now but let's refactor it soon if you agree.

EvanDietzMorris commented 6 months ago

I'd add that this source is very small so these things are not really a huge deal, I'm mainly being picky because I think these are concepts we want to adopt for all parsers.

beasleyjonm commented 6 months ago

I agree with all of these points. I was writing the "target_for" edges to a CSV because I was giving these Drug,Target,Disease triples to Elvin as a training/test dataset for his 2023 summer project, but this could be replaced in the future since the same dataset could just be created as a KG query now. Also the reason I did what you mention in point #3 is because normalization was overwriting the duplicate edges with unique DrugMechDB IDs instead of merging the separate IDs in a list on a single edge. I'm not sure why this was the case. Is there a parameter in "merge strategy" in the graph spec that needs to be specified?

EvanDietzMorris commented 6 months ago

Did you have them as lists or single value properties? The merging code by default concatenates lists but it doesn't turn single value properties into lists unless they are special cases. If they were lists and weren't getting merged correctly, either the edges weren't considered the same (the criteria is subject-predicate-object-primary_knowledge_source) or it's possibly a bug. Note that there is also a way to specify that a property that is a list should be a set, but unless there are multiple graphs with the same drugmech id I don't think we should have duplicates here.

beasleyjonm commented 6 months ago

Ahhh they were just single values. That definitely makes sense, so this could be easily changed!