Closed edeutsch closed 1 month ago
Is this the same as the blocklist filter? If so, I think @kvnthomas98 could take a look at speeding it up
I don't actually know what it does exactly. I'm just thinking it seems very slow compared to other parts of the system for what I imagine it might be doing.
I'm pretty sure it is indeed the blocklist filter step
@kvnthomas98 Do you mind adding this to the Translator GH project page so you (or another one of us PSU people) has it on their docket?
Sure, I've added it to the backlog
Regular Expression Matching was slowing this process down. Not a lot of our text matching for node names are regular expressions. So I created a separate field in the general_concepts.json file to specify regular expressions. And we do regular expression matching only for these fields and for the others we do a string comparision. In the below query, the time taken reduces from 0.5 to 0.15. Before Changes:
After changes:
query given above now takes 0.18 seconds at the removing general concepts step. Good to close @edeutsch ?
Sure, fine to close.
I will say I am a bit surprised that we're doing string matching and regular expression matching. I would have expected that our graph was NodeSynonymized and our blocklist would be a set of NodeSynonymized CURIEs, and thus the matching should all be very fast dictionary lookups of CURIEs. And thus the process should take < 0.018 seconds, rather than 0.18 seconds. But anyway, 0.18 seconds is sufficient I think, thanks.
Based on profiling tests in #2372 and #2373, it seems that the step that removes general concepts from the knowledge graph seems to take much longer than seems naively expected. In #2373, it look 2.6 seconds. I'm thinking it should take <0.1 seconds, so I wonder if there is some easily addressable gross inefficiency that could be remedied?
Anyone willing to put this up on the lift and isolate where it is taking the most time and computation?