RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Idea for decorating ALL nodes/edges only at very end of ARAX query? #1375

Closed amykglen closed 3 months ago

amykglen commented 3 years ago

with #1359 and #1370 in mind, I've been playing with this idea: might it make sense to decorate nodes/edges involved in an ARAX query answer with the nice 'additional' attributes provided by KG2/c (description, iri, publications, provided_by, etc.) only at the very end of an ARAX query, after the results have been filtered down and everything?

I believe ARAX modules downstream of expand don't use this information(?), and most of the time a huge chunk (99% for medium/large queries!) of those nodes/edges are thrown out at the end of ARAX's processing anyway, when it filters results.

meaning, the system could look something like this:

this would eliminate the plover 'postprocessing' step (which can be time consuming for large queries, when it has to look up thousands/millions of nodes/edges in sqlite). and the number of nodes/edges that ARAX would need to decorate would be quite small really, since ARAX automatically filters results down to 100 (for JSON queries at least). so looking up these nodes/edges in sqlite could happen very quickly. (maybe we wouldn't even need Redis.)

I think this would result in a big reduction in memory consumption for large queries as well, and might help with #1370.

if we were worried about the fact that our additional attributes wouldn't be available for people who hit up the KG2 API directly, maybe we could add some sort of special parameter that specified whether to decorate or not decorate (so ARAX would just always tell the KG2 API to 'not decorate').

any thoughts?

amykglen commented 3 years ago

realized perhaps the ranker needs access to provided_by and publications for edges. so the idea might require one little tweak there.

edeutsch commented 3 years ago

I think this is a terrific idea! Three thumbs up! I can add a query option to control whether this happens to the API. I'm thinking the default should be to INclude metadata and have an option to turn it off? Maybe something like minimal_metadata = True? But feel free to suggest alternatives, I don't feel strongly. We could allow the option for ARAX, too.

amykglen commented 3 years ago

nice, that sounds great to me!

I could add a class ARAXDecorator (or something like that), that would just need to be plugged into ARAXQuery I guess (automatically called after ranking and everything is done, depending on the minimal_metadata option?)

dkoslicki commented 3 years ago

One thing to note: the ranker in resultify() does use some of the additional properties. One example is the number of publications to inform literature based rankings. Others, which I don’t think would be affected by what you are proposing (but thought I’d mention it), are things like chi-square values, probabilities, log ratios, etc which are added by overlay stuff.

amykglen commented 3 years ago

ok, changes for the bulk of this issue and #1359 are in master and are ready to be rolled out to ARAX and the KG2 API.

large queries using plover seem to be moving much faster now that there's almost no 'postprocessing' step. (I'm seeing about 30 seconds for the plover-related portion of the second hop in the query in #1370 vs. about 120 seconds before.. my internet isn't super fast though, and the vast majority of that 30 seconds is spent just receiving plover's response (which is fairly large now since edge/node objects are returned instead of just IDs), so I'm expecting the time to be better than that for arax.ncats.io.)

I made it so that plover returns edges in full (including publications, provided_by) but nodes only with their 'core' properties (name, category). nodes are then decorated with additional attributes at the very end of Expand (by calling ARAXDecorator). I think this node decoration should ultimately happen at the end of ARAXQuery instead, after results have been filtered (I just wasn't totally sure where to plug it in). figuring that can be done as time allows.

also, if/when the minimal_metadata parameter is set up that will be very easy to plug into expand.

edeutsch commented 3 years ago

I have just rolled out master to all deployments including production and /kg2. Please test

amykglen commented 3 years ago

awesome, thanks! looking good to me! I suppose I'll leave this issue open until we get the call to ARAXDecorator moved to ARAXQuery (which should help free up memory/save further time) and get the minimal_metadata flag set up

edeutsch commented 4 months ago

@amykglen what's the state of this issue?

amykglen commented 3 months ago

I think we should close this one - there hasn't really been demand for it and it could be complicated to implement