OHDSI / ETL-CMS

Workproducts to ETL CMS datasets into OMOP Common Data Model
Apache License 2.0
96 stars 53 forks source link

Person ID as integer, CMS's synthetic data has string IDs #44

Open georges-hatem opened 7 years ago

georges-hatem commented 7 years ago

In CMS synthetic data, the Beneficiary ID, 'DESYNPUF_ID', is of type String (technically hexadecimal).

When using Spark or other big data engines, it is generally easier to use this existing unique ID than to compute our own incremental ID, as that would require computing all IDs on a single node in the cluster (each id needs the value of the previous one). That would not allow parallelization of the processing.

Is there any particular reason for making IDs integer, instead of string?

Thanks!

ChristopheLambert commented 7 years ago

I wanted to retain the original IDs as well, but the person_id field in the CDM is integer, see http://www.ohdsi.org/web/wiki/doku.php?id=documentation:cdm:person. We could have converted the hex to decimal, but the decision to use incremental ids was made before we stepped in and took the project forwards. There are several other ids that are incremented sequentially, so fixing person_id may not get you everything you need for parallelization.