OHDSI / SqlRender

This is an R package and Java library for rendering parameterized SQL, and translating it to different SQL dialects.
https://ohdsi.github.io/SqlRender
Other
82 stars 77 forks source link

Add listagg()/stringagg() to OHDSI-SQL #268

Closed ablack3 closed 2 years ago

ablack3 commented 2 years ago

Would it be possible to add listagg as a supported cross database function? It seems like most databases support it a little differently.

Oracle: listagg https://docs.oracle.com/cd/E11882_01/server.112/e41084/functions089.htm#SQLRF30030 Postgres: string_agg https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-string_agg-function/ Sqlite: group_concat https://www.sqlite.org/search?q=group_concat Bigquery: string_agg https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#string_agg Redshift: listagg https://docs.aws.amazon.com/redshift/latest/dg/r_LISTAGG.html SqlServer: string_agg https://docs.microsoft.com/en-us/sql/t-sql/functions/string-agg-transact-sql?view=sql-server-ver15 Hive: concat_ws() + COLLECT_LIST()/COLLECT_SET() https://stackoverflow.com/questions/51520912/hive-equivalent-of-listagg

schuemie commented 2 years ago

Seems reasonable. Could you create a PR with the required translation rules + unit tests?

ablack3 commented 2 years ago

I'm still not sure if this is possible and I don't think it is as simple as replacing the name of the function. I'll start by doing some experiments/unit tests and posting the results here.

ablack3 commented 2 years ago

Here is the same query using listagg from the OHDSI Oncology regimen finder in four dialects. Currently we are using separate sets of SQL files, one for each dbms.

# sqlite
with cte0 as (
  select distinct person_id, ingredient_start_date , concept_name from @writeDatabaseSchema.@regimenTable
),
cte as (
  select r.person_id, r.ingredient_start_date as regimen_start_date,
  group_concat(lower(r.concept_name), ',')  as regimen
  from cte0 r
  group by r.person_id, r.ingredient_start_date
  ORDER BY lower(r.concept_name)
)
...

# postgres
with cte as (
  select r.person_id, r.ingredient_start_date as regimen_start_date,
  STRING_AGG(DISTINCT lower(r.concept_name), ','
             ORDER BY lower(r.concept_name)) as regimen
  from @writeDatabaseSchema.@regimenTable r
  group by r.person_id, r.ingredient_start_date
)
...

# bigquery
with cte as (
  select r.person_id, r.ingredient_start_date as regimen_start_date,
  STRING_AGG(DISTINCT lower(r.concept_name), ','
             ORDER BY lower(r.concept_name)) as regimen
  from @writeDatabaseSchema.@regimenTable r
  group by r.person_id, r.ingredient_start_date
)
...

# redshift
with cte as (
  select r.person_id, r.ingredient_start_date as regimen_start_date,
  LISTAGG(DISTINCT lower(r.concept_name), ',')
  WITHIN GROUP (ORDER BY lower(r.concept_name)) as regimen
  from @writeDatabaseSchema.@regimenTable r
  group by r.person_id, r.ingredient_start_date
)
...

I don't think the syntax differences shown above can be handled by the current find and replace approach used in SqlRender. If we had a true OHDSI-SQL a parser and could translate the parsed abstract syntax tree then maybe it could work.

A few interesting links: Sql grammar with syntax diagrams: https://www.jooq.org/doc/latest/manual/sql-building/sql-parser/sql-parser-grammar/ Sql parser in Java: https://github.com/JSQLParser/JSqlParser/wiki Sql parser in R: https://github.com/ianmcook/queryparser

schuemie commented 2 years ago

I've explored full SQL parsing a while ago, but eventually decided against it because of the large amount of complexity it added (including how to formulate the translation rules). You're welcome to explore it again of course.

Do we really need LISTAGG()? Could you provide a use case where this is needed and cannot be solved with the currently supported SQL statements?

ablack3 commented 2 years ago

Makes sense. The use case I'm looking at is in the OncologyRegimenFinder package here. This is how we are currently identifying oncology drug regimens. I think this algorithm could be improved or at least cleaned up and I'm not sure using concatenated ingredient names as regimen identifiers is a good idea but it's what we have right now. This package only works on redshift primarily because of the use of listagg.

Something like MATCH_RECOGNIZE might work better for finding regimens but I think this function is not implemented across all platforms.

Do you think the translation examples above could be implemented using the current approach? If not I think I'll close the issue and explore alternatives for the oncology work. I might try some experiments with full sql parsing and share the results.

schuemie commented 2 years ago

I don't think the examples you have can be supported with the current SqlRender approach of translation rules.

I don't like concatenating strings server-side to capture drug regimens. I would probably create a (temp) table for regimens, with multiple rows per regimen (maybe one per drug. I don't know how they're defining regimens here). You can then pull that table to R, and manipulate it any way you want, including concatenating names to create a single 'human readable' identifier.