Non-deterministic use of `row_number()`

lawrenceadams commented 1 month ago

row_number is sprinkled out the repository, however it is used in various different ways which are likely to give unexpected/non-deterministic behaviour between runs. We usually will have a logical means of ordering them - if not date, then an ID of some sort. They tend to be used as:

row_number() over ()
row_number() over(order by (select null)

The variation is likely to be from a mixture of copy-pasted sources (e.g. T-SQL doesn't allow ...OVER () but Postgres/DuckDB do). The main two offenders I can find so far:

https://github.com/OHDSI/dbt-synthea/blob/ae791145d50c9e0693880ff9ed37d60f7cc0195d/models/omop/location.sql#L2

https://github.com/OHDSI/dbt-synthea/blob/75a1e12ae218f1fbaa97cda301423f8113508b25/models/omop/provider.sql#L2

Although this is probably has low impact downstream, it may be causing unexpected behaviour e.g. #47

lawrenceadams commented 1 month ago

Interestingly the former has been solved upstream:

https://github.com/OHDSI/ETL-Synthea/pull/200/files

katy-sadowski commented 1 month ago

I definitely agree we should choose a consistent and deterministic way of doing this, and I like ROW_NUMBER() OVER() ... are you sure SQL Server doesn't support it? My Googling says yes but I've never actually used SQL Server.

lawrenceadams commented 1 month ago

I definitely agree we should choose a consistent and deterministic way of doing this, and I like ROW_NUMBER() OVER() ... are you sure SQL Server doesn't support it? My Googling says yes but I've never actually used SQL Server.

Sorry I made my point badly - you're absolutely right it does support it, but it does not support an empty OVER () clause (like my first example above), instead you need to provide an order by sequence or do OVER (SELECT NULL) (which I think we should avoid)

https://stackoverflow.com/questions/44105691/row-number-without-order-by

katy-sadowski commented 1 month ago

Ahh OK! That makes more sense 🙃 and agree, we should choose some ordering key for consistency even if there are dupes in a table (which will happen in OMOP).

OHDSI / dbt-synthea

Non-deterministic use of `row_number()` #77