RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Speed enhancement: Make Removing general concepts from query graph faster? #2388

Closed edeutsch closed 1 month ago

edeutsch commented 1 month ago

Based on profiling tests in #2372 and #2373, it seems that the step that removes general concepts from the knowledge graph seems to take much longer than seems naively expected. In #2373, it look 2.6 seconds. I'm thinking it should take <0.1 seconds, so I wonder if there is some easily addressable gross inefficiency that could be remedied?

Anyone willing to put this up on the lift and isolate where it is taking the most time and computation?

dkoslicki commented 1 month ago

Is this the same as the blocklist filter? If so, I think @kvnthomas98 could take a look at speeding it up

edeutsch commented 1 month ago

I don't actually know what it does exactly. I'm just thinking it seems very slow compared to other parts of the system for what I imagine it might be doing.

amykglen commented 1 month ago

I'm pretty sure it is indeed the blocklist filter step

dkoslicki commented 1 month ago

@kvnthomas98 Do you mind adding this to the Translator GH project page so you (or another one of us PSU people) has it on their docket?

kvnthomas98 commented 1 month ago

Sure, I've added it to the backlog

kvnthomas98 commented 1 month ago

Regular Expression Matching was slowing this process down. Not a lot of our text matching for node names are regular expressions. So I created a separate field in the general_concepts.json file to specify regular expressions. And we do regular expression matching only for these fields and for the others we do a string comparision. In the below query, the time taken reduces from 0.5 to 0.15. Before Changes:

image

After changes:

image
kvnthomas98 commented 1 month ago

query given above now takes 0.18 seconds at the removing general concepts step. Good to close @edeutsch ?

edeutsch commented 1 month ago

Sure, fine to close.

I will say I am a bit surprised that we're doing string matching and regular expression matching. I would have expected that our graph was NodeSynonymized and our blocklist would be a set of NodeSynonymized CURIEs, and thus the matching should all be very fast dictionary lookups of CURIEs. And thus the process should take < 0.018 seconds, rather than 0.18 seconds. But anyway, 0.18 seconds is sufficient I think, thanks.