Closed ablack3 closed 2 years ago
Seems reasonable. Could you create a PR with the required translation rules + unit tests?
I'm still not sure if this is possible and I don't think it is as simple as replacing the name of the function. I'll start by doing some experiments/unit tests and posting the results here.
Here is the same query using listagg from the OHDSI Oncology regimen finder in four dialects. Currently we are using separate sets of SQL files, one for each dbms.
# sqlite
with cte0 as (
select distinct person_id, ingredient_start_date , concept_name from @writeDatabaseSchema.@regimenTable
),
cte as (
select r.person_id, r.ingredient_start_date as regimen_start_date,
group_concat(lower(r.concept_name), ',') as regimen
from cte0 r
group by r.person_id, r.ingredient_start_date
ORDER BY lower(r.concept_name)
)
...
# postgres
with cte as (
select r.person_id, r.ingredient_start_date as regimen_start_date,
STRING_AGG(DISTINCT lower(r.concept_name), ','
ORDER BY lower(r.concept_name)) as regimen
from @writeDatabaseSchema.@regimenTable r
group by r.person_id, r.ingredient_start_date
)
...
# bigquery
with cte as (
select r.person_id, r.ingredient_start_date as regimen_start_date,
STRING_AGG(DISTINCT lower(r.concept_name), ','
ORDER BY lower(r.concept_name)) as regimen
from @writeDatabaseSchema.@regimenTable r
group by r.person_id, r.ingredient_start_date
)
...
# redshift
with cte as (
select r.person_id, r.ingredient_start_date as regimen_start_date,
LISTAGG(DISTINCT lower(r.concept_name), ',')
WITHIN GROUP (ORDER BY lower(r.concept_name)) as regimen
from @writeDatabaseSchema.@regimenTable r
group by r.person_id, r.ingredient_start_date
)
...
I don't think the syntax differences shown above can be handled by the current find and replace approach used in SqlRender. If we had a true OHDSI-SQL a parser and could translate the parsed abstract syntax tree then maybe it could work.
A few interesting links: Sql grammar with syntax diagrams: https://www.jooq.org/doc/latest/manual/sql-building/sql-parser/sql-parser-grammar/ Sql parser in Java: https://github.com/JSQLParser/JSqlParser/wiki Sql parser in R: https://github.com/ianmcook/queryparser
I've explored full SQL parsing a while ago, but eventually decided against it because of the large amount of complexity it added (including how to formulate the translation rules). You're welcome to explore it again of course.
Do we really need LISTAGG()
? Could you provide a use case where this is needed and cannot be solved with the currently supported SQL statements?
Makes sense. The use case I'm looking at is in the OncologyRegimenFinder package here. This is how we are currently identifying oncology drug regimens. I think this algorithm could be improved or at least cleaned up and I'm not sure using concatenated ingredient names as regimen identifiers is a good idea but it's what we have right now. This package only works on redshift primarily because of the use of listagg.
Something like MATCH_RECOGNIZE might work better for finding regimens but I think this function is not implemented across all platforms.
Do you think the translation examples above could be implemented using the current approach? If not I think I'll close the issue and explore alternatives for the oncology work. I might try some experiments with full sql parsing and share the results.
I don't think the examples you have can be supported with the current SqlRender approach of translation rules.
I don't like concatenating strings server-side to capture drug regimens. I would probably create a (temp) table for regimens, with multiple rows per regimen (maybe one per drug. I don't know how they're defining regimens here). You can then pull that table to R, and manipulate it any way you want, including concatenating names to create a single 'human readable' identifier.
Would it be possible to add listagg as a supported cross database function? It seems like most databases support it a little differently.
Oracle: listagg https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030 Postgres: string_agg https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-string_agg-function/ Sqlite: group_concat https://www.sqlite.org/search?q=group_concat Bigquery: string_agg https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg Redshift: listagg https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html SqlServer: string_agg https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql?view=sql-server-ver15 Hive: concat_ws() + COLLECT_LIST()/COLLECT_SET() https://stackoverflow.com/questions/51520912/hive-equivalent-of-listagg