cladteam / CCDA_OMOP_by_Python

2 stars 3 forks source link

use correct hash function #125

Open chrisroederucdenver opened 3 days ago

chrisroederucdenver commented 3 days ago

Steph and co. in N3C use an md5 hash of semicolon concatenated fields, with further work to get an integer? Dan was using sha256() and mine Sha2(x,256).

Need to agree on the fields, the delimiter, pipe was suggested, the function, some business to cut down to the first 15 characters and the int conversion. Should also consider how the source string is encoded. UTF-8?

The SQL code casts to bigint. What is the equivalent in Python?

chrisroederucdenver commented 3 days ago

from N3C

cms_condition_occurrence as ( select distinct md5(concat_ws( ';' , COALESCE(co.BID, '') , COALESCE(co.provider_id, '') , COALESCE(co.condition_start_date, '') , COALESCE(co.condition_end_date, '') , COALESCE(co.condition_concept_id, '') , COALESCE(co.condition_source_value, '') , COALESCE(co.visit_concept_id, '' ) , COALESCE(co.source_domain, '' ) , COALESCE(co.condition_status_concept_id, '') , COALESCE(co.condition_status_source_value, '') , COALESCE(co.visit_occurrence_id, '') , COALESCE('Medicare', '') )) as cms_hashed_condition_occurrence_id , CAST( co.BID AS long ) as person_id , cast(condition_concept_id as int) as condition_concept_id , cast( condition_start_date as date) as condition_start_date , cast( condition_start_date as timestamp ) as condition_start_datetime , cast( condition_end_date as date) as condition_end_date , cast( condition_end_date as timestamp ) as condition_end_datetime , cast( 32810 as int) as condition_type_concept_id , cast( condition_status_concept_id as int) condition_status_concept_id -- need to reference the original column names in order to determine the condition status , cast(null as string ) as stop_reason , cast( co.provider_id as long) as provider_id , cast( co.visit_occurrence_id as bigint) as visit_occurrence_id , cast(visit_detail_id as bigint) as visit_detail_id , cast( co.condition_source_value as string) as condition_source_value , cast( co.condition_source_concept_id as int) as condition_source_concept_id , cast( co.condition_status_source_value as string ) as condition_status_source_value --- value for the condition_status_concept_id- how dx was givne -build using the column and the seqnum , co.source_domain from final_co co left join/UNITE/[PPRL] CMS Data & Repository/pipeline/transform/04 - domain_mapping/person` p on co.BID = p.person_id -- match visit on bid, provider_id, visit start_date and end_date, visit_concept_id( that are all unique based on the source domain, 32810= visit_type_concept_id is used for claims data. / TODO: join when person is ready person is empty/ where condition_concept_id is not null )

SELECT
  *
-- 2251799813685247 = ((1 << 51) - 1) - bitwise AND gives you the first 51 bits
--cast(conv(substr(hashed_id, 1, 15), 16, 10) as bigint) & 2251799813685247 as drug_exposure_id_51_bit
--, cast(conv(substr(cms_hashed_condition_occurrence_id, 1, 15), 16, 10) as bigint) & 2251799813685247 as condition_occurrence_id
, cast(conv(substr(cms_hashed_condition_occurrence_id, 1, 15), 16, 10) as bigint) as condition_occurrence_id
FROM cms_condition_occurrence

`