frostyfan109 / tranql

A Translator Query Language
https://researchsoftwareinstitute.github.io/data-translator/apps/tranql
MIT License
0 stars 1 forks source link

Memory Leak #113

Closed frostyfan109 closed 5 years ago

frostyfan109 commented 5 years ago

Currently, when the following query (Workflow 5 v3) is run, it causes what appears to be a memory leak. It is likely a recent change in the ast that causes this.

SELECT population_of_individual_organisms->chemical_substance->gene->biological_process_or_activity<-phenotypic_feature
  FROM "/schema"
 WHERE icees.table = 'patient'
   AND icees.year = 2010
   AND icees.cohort_features.AgeStudyStart = '0-2'
   AND icees.feature.EstResidentialDensity < 1
   AND icees.maximum_p_value = 1
   AND chemical_substance !=~ '^(SCTID.*|rxcui.*|CAS.*|SMILES.*|umlscui.*)$'
frostyfan109 commented 5 years ago

I just poked around a little and it appears that the deep_merge method, specifically of edges, is causing this, but it isn't actually a memory leak.

I believe that somehow there is likely a self-referencing array or dict. It shouldn't be possible that there's a self-referencing array unless the ast itself is creating it, not the reasoner.

Since deep_merge attempts to join items within both elements' respective arrays, this means that it recurses infinitely, or at least until it reaches the maximum recursion depth, which would likely take a lot of memory.

frostyfan109 commented 5 years ago

I believe this is not actually the problem now. It is just that deep_merge is merging the edge_attributes of many, many duplicate nodes, and since edge attributes from ICEES contains things like rows, an array of dicts, it retains every node's property. This quickly gets extremely large.

Could change it to only merge top level arrays maybe, or just only merge the top level at all.

Edit: For clarity, it gets very, very large, very quickly, because it merges both into each other. This means that the massive amount of edge_attributes already merged into the winner node will first be merged into the losing node, then merged back into the winning node, which already has them. The time complexity of the algorithm was ridiculous.