Row_number logic in cohort table query

SSMK-wq commented 4 years ago

Hi,

May I kindly request you to help me understand the logic of generating row_number and what does it do? I mean I understand that you concat a random number with the current timestamp but why can't a normal row number be used?

Why is Checsum and MD5 is used? I understand these functions/utilities are used during file transfer to verify the integrity of files.

select co.*, p.*,
      row_number() over (order by ABS(CHECKSUM(MD5(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT))) % 123456789) rn
    from s1.depat co
    join s2.person p
      on co.subject_id = p.person_id

I can understand that it is not about just generating sequential row_numbers but unfortunately postgresql doesn't have these functions like checksum and it all fails.

Is there any generalized way to write this row_number which can work in postgresql as well ? Will it create a problem if I generate random row numbers without checksum or abs function. Just like as shown below

MD5(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT) #this works in postgresql

(RANDOM()::TEXT || CLOCK_TIMESTAMP()::TEXT) # this works as well

ABS cannot be used because checksum fails in postgresql which in turn goes on to explain that division by 123456789 cannot be done as well

But I see that this rn (row_number) is being used in several where clauses in the query file CreateCohortsV6.sql. So can help me understand this?

Do you have any suggestions here? Or Am I making any mistakes here?

jswerdel commented 4 years ago

Hello I have made changes in the package that should correct the issues with the sql files and postgres. Please let me know if there is still a problem with postgres

SSMK-wq commented 4 years ago

Hi @jswerdel,

I did try executing the package. Few questions

1) May I know why there should be a difference of 365 or more days as shown below. Was it for any specific study which is left here by mistake? or it's criteria to use PheValuator?

and (CAST(co.COHORT_START_DATE AS DATE) - CAST(o.observation_period_start_date AS DATE)) >= 365

2) Still there is a postgres compatibility issue as shown below

o.observation_period_start_date between cast(19000101 as varchar) and cast(39000101 as varchar)

This when modified like below, the query works (Look for the quotes and DATE keyword)

o.observation_period_start_date between cast('19000101' as DATE) and cast('39000101' as DATE)

3) Another postgres compatibility issue as shown below

having minObsStart between cast(19000101 as varchar) and cast(39000101 as varchar)

the above when modified as shown below, it works.

having minObsStart between cast('19000101' as DATE) and cast('39000101' as DATE) 4) In addition, I also think that the error report.txt is not updated yet. Because when there was a date issue (issue 2) and when I opened the error report, I was able to still see that MD5(RANDOM::TEXT) thingy in error report.

Though I fixed these issues locally, I see throughout it's the same date issues (VARCHAR)

Can help me with this 4 questions please?

jswerdel commented 4 years ago

1) we want to be sure there is at least 365 days prior to the index date (start of health outcome) so that there is enough prior information to inform the model 2) This has now been corrected 3) This has now been corrected 4) If this is still occurring, please include the error report in the next issue report.

Please re-install the latest version of PheValuator

Thanks for your work in helping us make PheValuator compatible with postgres.

OHDSI / PheValuator

Row_number logic in cohort table query #12