Closed chrisroederucdenver closed 2 weeks ago
from N3C
cms_condition_occurrence as ( select distinct md5(concat_ws( ';' , COALESCE(co.BID, '') , COALESCE(co.provider_id, '') , COALESCE(co.condition_start_date, '') , COALESCE(co.condition_end_date, '') , COALESCE(co.condition_concept_id, '') , COALESCE(co.condition_source_value, '') , COALESCE(co.visit_concept_id, '' ) , COALESCE(co.source_domain, '' ) , COALESCE(co.condition_status_concept_id, '') , COALESCE(co.condition_status_source_value, '') , COALESCE(co.visit_occurrence_id, '') , COALESCE('Medicare', '') )) as cms_hashed_condition_occurrence_id , CAST( co.BID AS long ) as person_id , cast(condition_concept_id as int) as condition_concept_id , cast( condition_start_date as date) as condition_start_date , cast( condition_start_date as timestamp ) as condition_start_datetime , cast( condition_end_date as date) as condition_end_date , cast( condition_end_date as timestamp ) as condition_end_datetime , cast( 32810 as int) as condition_type_concept_id , cast( condition_status_concept_id as int) condition_status_concept_id -- need to reference the original column names in order to determine the condition status , cast(null as string ) as stop_reason , cast( co.provider_id as long) as provider_id , cast( co.visit_occurrence_id as bigint) as visit_occurrence_id , cast(visit_detail_id as bigint) as visit_detail_id , cast( co.condition_source_value as string) as condition_source_value , cast( co.condition_source_concept_id as int) as condition_source_concept_id , cast( co.condition_status_source_value as string ) as condition_status_source_value --- value for the condition_status_concept_id- how dx was givne -build using the column and the seqnum , co.source_domain from final_co co left join
/UNITE/[PPRL] CMS Data & Repository/pipeline/transform/04 - domain_mapping/person` p on co.BID = p.person_id
-- match visit on bid, provider_id, visit start_date and end_date, visit_concept_id( that are all unique based on the source domain, 32810= visit_type_concept_id is used for claims data.
/ TODO: join when person is ready person is empty/
where condition_concept_id is not null
)
SELECT
*
-- 2251799813685247 = ((1 << 51) - 1) - bitwise AND gives you the first 51 bits
--cast(conv(substr(hashed_id, 1, 15), 16, 10) as bigint) & 2251799813685247 as drug_exposure_id_51_bit
--, cast(conv(substr(cms_hashed_condition_occurrence_id, 1, 15), 16, 10) as bigint) & 2251799813685247 as condition_occurrence_id
, cast(conv(substr(cms_hashed_condition_occurrence_id, 1, 15), 16, 10) as bigint) as condition_occurrence_id
FROM cms_condition_occurrence
`
13 digits can be repeated in SQL, cooperating with Dan in Claims, though in other news we may not integrate with that work.
def create_hash(input_string): """ matches common SQL code when that code also truncates to 13 characters SQL: cast(conv(substr(md5(test_string), 1, 15), 16, 10) as bigint) as hashed_value """ hash_value = hashlib.md5(input_string.encode('utf-8')) truncated_hash = hash_value.hexdigest()[0:13] int_trunc_hash_value = int(truncated_hash, 16) return int_trunc_hash_value
As above. We're ready to interoperate if necessary, but it might not be an issue.
Working with Matt and Dan on this. Latest is that we should use SHA2 256. Discussion to continue at the DPC meetings Th. mornings 10am ET.
Up for discussion at the DPC meeting Thursday Nov. 7, 2024. 8am MT
discussed here https://docs.google.com/document/d/1dGsC_g9mDZD2s712-Td2teq7W0GAJix_dIPDKgHqKc4/edit?tab=t.0 closing as duplicate of #40
Steph and co. in N3C use an md5 hash of semicolon concatenated fields, with further work to get an integer? Dan was using sha256() and mine Sha2(x,256).
Need to agree on the fields, the delimiter, pipe was suggested, the function, some business to cut down to the first 15 characters and the int conversion. Should also consider how the source string is encoded. UTF-8?
The SQL code casts to bigint. What is the equivalent in Python?