Flowminder / FlowKit

FlowKit: Flowminder CDR analytics toolkit
https://flowminder.github.io/FlowKit/
Mozilla Public License 2.0
86 stars 21 forks source link

Allow CustomQuery to take an optional dict of Query objects #814

Open jc-harrison opened 5 years ago

jc-harrison commented 5 years ago

There have been discussions recently regarding the desire for users who prefer to write their own SQL queries to be able to make more use of FlowMachine's caching.

It seems to me that we could go some way towards supporting this use case by making it possible to pass query objects as sub-queries of a CustomQuery.

We could do this by adding a new, optional argument to CustomQuery (e.g. dependencies) which accepts a dict of Query objects, which could be matched to placeholders in the SQL string.

Something like:

CustomQuery(
    sql="""
        SELECT ...
        FROM {another_custom_query}
        INNER JOIN {daily_location}
        ON ...
    """,
    dependencies={
        "another_custom_query": CustomQuery(...),
        "daily_location": daily_location(...)
    },
    columns=...,
)

The dependencies could then be included in query.dependencies(), and query._make_query would return something along the lines of

self.sql.format({name: query.get_query() for name, query in dependencies.items()})

which would then make use of the cached dependencies where they exist.

It woul be good to hear from FlowMachine users / potential users as to whether this is something that would be useful in practice.

greenape commented 5 years ago

A legit reason to use **kwargs? 😲

I like this idea but I think it would break the hashing of custom query, so we'd need a different solution there.

greenape commented 2 years ago

Expanding on the above - custom query hashing is dependent on being able to normalise the input SQL, which we do by parsing it against the PostgreSQL parser. Obviously that's going to fail if we give it a template string instead of valid SQL, but waiting on the string to be generated is a blocking op, which we don't want to do in the __init__ - at that point, you might as well just generate the sql and pass it in to the existing class.

Maybe one could fill the gaps for hashing purposes with valid-but-meaningless SQL? e.g. sub in (select 1) as another_custom_query?

jc-harrison commented 2 years ago

We could perhaps fill the gaps with select * from {another_custom_query.fully_qualified_table_name} for hashing (i.e. for hashing purposes, assume all sub-queries are already cached)?