Explore PCORNet CDM to OMOP Conversion

vasanthi014 commented 2 years ago

A good place for us to start look into and replicate the data transformation work (from PCORnet CDM to OMOP CDM) would be: https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/tree/master/ETLProcess/scripts.

By skimming over it, I think a rough process is as follows:

create OMOP table shells: https://github.com/OHDSI/CommonDataModel/blob/master/inst/ddl/5.4/postgresql/OMOPCDM_postgresql_5.4_ddl.sql Note: we will need to modify this codes to fit into the SnowSQL syntax, but should be something very straightforward.
load concept crosswalks: https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/blob/master/ETLProcess/scripts/p2o_term_xwalk.sql Note: we could either run this SQL directly upload this tsv file into snowflake as p2o_term_xwalk table: https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/blob/master/ETLProcess/scripts/export/p2o_term_xwalk_export.tsv
perform transformation steps in SQL: all SP_P2O_SRC....sql on https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/tree/master/ETLProcess/scripts Note: we will need to modify this codes to fit into the SnowSQL syntax Another resource is https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm which will also need to be modified
populate the standardized vocabulary tables: https://github.com/OHDSI/Vocabulary-v5.0

Few other related resources:

Mayo work on cross-CDM transformation: https://github.com/yuey11/PCORnet2OMOP_ETL_tool
Related ticket: https://github.com/Missouri-BMI/GROUSE/issues/47

It appears that we should be moving to OMOP 5.4, and the PEDSnet codes may be more straightforward to modify. Either way, thinking to maintain staging tables with at least the table of origin of the records for ease of data modification moving forward. For now, also keep all codes in a single crosswalk table, but revisit that if the scripts run too slowly.

mosaa commented 2 years ago

@kzraryan-mu @vasanthi014: Comments documented here are relevant: https://github.com/Missouri-BMI/GROUSE/issues/47#issuecomment-1055999807

sxinger commented 2 years ago

@mosaa should we split the task between Rana and Christine? I will join our daily scrum to discuss more tomorrow.

mosaa commented 2 years ago

@sxinger: Discuss with Vasanthi. Christie doing the first pass sounds great. We shall use this fork for code development. Any enhancement shall remain in a branch so that we can contribute back. https://github.com/Missouri-BMI/Data-Ingestion-and-Harmonization

sxinger commented 2 years ago

@spinkac for questions about PCORnet CDM and our source data, you can ask @kzraryan-mu

spinkac commented 2 years ago

@vasanthi014 @sxinger I have already done the first two above, and I am working on the third. As a heads-up, there are differences in the mapping between the N3C and the paper, the version of OMOP in the codes from N3C is 5.3, while those from OHDSI are 5.4, and the N3C SQL codes are relying on a set of crosswalk tables, not the single table found above.

rwaitman commented 2 years ago

If Jacob Kean can't comment on which version of OMOP the VA is using you can also contact a friend Michael Matheny @ Vanderbilt. Matheny, Michael E (University) michael.matheny@Vanderbilt.Edu

sxinger commented 2 years ago

VA OMOP is on 5.3. We will stick with 5.3 @spinkac

From: Saoudian, Hamid (UofU) <[Hamid.Saoudian@va.gov](mailto:Hamid.Saoudian@va.gov)> 
Sent: Thursday, March 10, 2022 9:12 AM
To: Kean, Jacob T. <[Jacob.Kean@va.gov](mailto:Jacob.Kean@va.gov)>
Cc: Tucker, Jay N. <[Jay.Tucker2@va.gov](mailto:Jay.Tucker2@va.gov)>; [xsm7f@health.missouri.edu](mailto:xsm7f@health.missouri.edu); Casey Kangas <[ckangas@kennellinc.com](mailto:ckangas@kennellinc.com)>
Subject: RE: quick question about VA OMOP version

5.3.1

Hamid Saoudian
[hamid.saoudian@va.gov](mailto:hamid.saoudian@va.gov)
VA Informatics and Computing Infrastructure (VINCI)
Salt Lake City

From: Kean, Jacob T. <[Jacob.Kean@va.gov](mailto:Jacob.Kean@va.gov)> 
Sent: Thursday, March 10, 2022 8:24 AM
To: Saoudian, Hamid (UofU) <[Hamid.Saoudian@va.gov](mailto:Hamid.Saoudian@va.gov)>
Cc: Tucker, Jay N. <[Jay.Tucker2@va.gov](mailto:Jay.Tucker2@va.gov)>; [xsm7f@health.missouri.edu](mailto:xsm7f@health.missouri.edu); Casey Kangas <[ckangas@kennellinc.com](mailto:ckangas@kennellinc.com)>
Subject: FW: quick question about VA OMOP version

Hamid-

Please see below. Are we on OMOP v.5, and if so, which one?

Thanks,

Jacob

spinkac commented 2 years ago

@sxinger Sounds good. I will go back and search for older OMOP shells/reverse the updates to the newer versions. Shouldn't take long....

mosaa commented 2 years ago

@sxinger @spinkac @vasanthi014: A new repository has been created: https://github.com/Missouri-BMI/OMOP_ON_P_CDM/

spinkac commented 2 years ago

@kzraryan-mu @sxinger @mosaa @vasanthi014 The crosswalk table issue I referred to earlier. The crosswalk table, from (2) above, is a single table that contains entries for many different tables. The SQL codes from (3) are expecting separate crosswalks for each type of table, and the contents of those crosswalk tables is not consistent either between tables, or with the larger table from (2). I created an OMOP death table as a starter, but I am not confident that I have the joins to the crosswalk table both working correctly.

spinkac commented 2 years ago

@sxinger @mosaa @vasanthi014 @kzraryan-mu The SQL codes from N3C create staging tables which are then used to populate the actual OMOP tables. Are we wanting to continue this behavior?

Advantages:

This allows for information from different sites to be updated at different times
The would allow for one PCORI table to be updated without changing or reloading the information from other PCORI tables

Disadvantages:

This adds extra complexity and files
Every time anything in the staging table is changed, have to reload the whole OMOP table, so not clear how much time is being saved

Thoughts?

kzraryan-mu commented 2 years ago

I thought we are populating OMOP on top of CDM tables. For I2B2 on CDM, we first create a table containing all CDM columns and additional columns needed for I2B2 ( As many times the relationship is not 1 to 1, some of the columns were not populated). Then to populate all the columns for I2B2, we created another table (final) using the previous table.

We can take the same approach for this one as well, right?

spinkac commented 2 years ago

@kzraryan-mu I am not familiar with what you did for I2B2. Can you please explain to me why you made tables with both types of variables as an intermediate step?

Also, since OMOP is another CDM, there are many cases where there is not a 1 to 1 mapping between tables (often it is many to many). How would you handle this in the method you are describing?

sxinger commented 2 years ago

@spinkac please arrange a meeting among Rana, you and I, to talk through the current issue you run into and Rana can also show us his approach transforming CDM to i2b2. i2b2 is just another data model with a giant "observation_fact" table instead of breaking down into smaller fact tables like "diagnosis", "procedures"...

kzraryan-mu commented 2 years ago

Yes, a meeting will be better to demonstrate what we did for I2B2 on PCORNet CDM.

Still I am giving you a heads up.

Question 1: For columns that needs only additional modifications like data type casting, formatting, merging which is related 1 to 1, one single table is useful to generate the final table. It works like a staging table. If this doesn't make sense for OMOP, you can try to populate directly to the final table from source. For example, there is a column in I2B2 call instance_num which in technical words groups 6 other primary key and separate them. Instead of modifying the 6 columns and grouping them in a sub query, it was better to visualize in the same table with CDM columns.

Question 2: we had 1 to 1, many to 1 and 1 to many relationships in I2B2 on CDM. 1 to many rows are added later in the final table after 1 to 1 and many to 1 rows were generated.

kzraryan-mu commented 2 years ago

@sxinger Actually, we used the multi fact table approach, so we also created the granular level fact tables like diagnosis, vital ...

sxinger commented 2 years ago

@kzraryan-mu i don't think that we will go that far for this OMOP transformation work, as we don't need to put an i2b2 query app on top of OMOP CDM, we will leverage existing ODHSI tools that are compatible with OMOP model. We will talk more during out meeting

mosaa commented 2 years ago

@kzraryan-mu @sxinger @spinkac @vasanthi014 @shossain-mizzou:

First of all, it would be interesting to see if N3C PCORnet CDM -> OMOP CDM can be used out-of-the-box with minor modifications.

Most interesting would be to see if we can model without ETL'ing from PCORnet CDM to OMOP. Our approach could be creating alias for synonymous columns, adding additional columns for OMOP specific mapping + adding additional tables that is not in the PCORnet (such as ontology tables).

Include me in the meeting as well.

mosaa commented 2 years ago

@sxinger: I think @kzraryan-mu is not referring to establish i2b2 on top of OMOP. He is referring to see if OMOP can be wrapper around PCORnet CDM. MU i2b2 is actually a wrapper around PCORnet CDM (architecture is interesting because no new ETL is needed).

sxinger commented 2 years ago

@sxinger Actually, we used the multi fact table approach, so we also created the granular level fact tables like diagnosis, vital ...

@mosaa I got the impression from this statement that physical mini-i2b2 fact tables were created. I think the approach N3C took was creating metadata mappings instead of a full-fledged ETL process. The mapping is where Christine got stuck and we will discuss on Friday. And yes, step 4. I described in comment#1 was to direct ingest the OMOP standardized vocabulary tables. @spinkac, please add @mosaa to our Friday meeting, his insight will be very valuable.

spinkac commented 2 years ago

@sxinger Already done. @vasanthi014 @soliman455 and @shossain-mizzou also invited via @mosaa .

please add @mosaa to our Friday meeting, his insight will be very valuable.

sxinger commented 2 years ago

03/18/2022 meeting outcomes:

Should we create staging stables?
- Yes. We want to preserve the staging table structure, at least identify source table and source columns, given the n-to-n mapping between pcornet cdm and omop
Based on N3C P2O transformation script, it seems that they are creating smaller mapping table for particular table or columns
- we will modify current transformation script using the the master mapping table (https://github.com/National-COVID-Cohort-Collaborative/Data-Ingestion-and-Harmonization/blob/master/ETLProcess/scripts/export/p2o_term_xwalk_export.tsv), until we run into efficiency issue (e.g. running time too long).
Codes review
- @spinkac will create new branch for each CDM table transformation script upload, such as "cdm_to_omop_death" (pcornet to omop transformation for the pcornet death table)
- After push table to this github repo, @spinkac will tag @sxinger or @kzraryan-mu for code review
- @kzraryan-mu or @vasanthi014 will show @spinkac how to use github desktop

spinkac commented 2 years ago

@sxinger @mosaa @kzraryan-mu @vasanthi014 The link from the PATRONUS meeting to their work on the PCORI to OMOP. https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm

They are currently going to OMOP 5.4, so that is the suggestion for us moving forward.

mosaa commented 2 years ago

PEDSNet src code seems to be more organized.

spinkac commented 2 years ago

@mosaa Are you able to view their codes? Song and I are not able to see them

sxinger commented 2 years ago

PEDSNet src code seems to be more organized.

I got a 404 from https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm

mosaa commented 2 years ago

Try navigating from here: https://github.com/PEDSnet the repository is public.

mosaa commented 2 years ago

It @spinkac @sxinger: I was able to do this earlier but now it's gone. I think it was moved to private status after the meeting.

spinkac commented 2 years ago

@sxinger @mosaa @vasanthi014 @kzraryan-mu I have pushed a bunch of crosswalk tables and the table shells for the OMOP 5.4 tables to the Github repo. If anyone has time to look at them, great, otherwise, they are pretty straightforward.

spinkac commented 2 years ago

@sxinger @mosaa @vasanthi014 @kzraryan-mu One important mapping table that the PATRONUS transformation is using is created from a text file using python. Is there a way to do something similar for our database?

sxinger commented 2 years ago

@spinkac is that text file provided? can you point to me the python codes?

spinkac commented 2 years ago

@sxinger The table is https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm/blob/main/sql_etl/data/concept_map.txt and the associated python code is https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm/blob/main/sql_etl/loading_pedsnet/process.py

sxinger commented 2 years ago

@vasanthi014 or @kzraryan-mu can either of you show Christine how to directly stage a text file onto snowflake? or help her to stage it? The text file is: https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm/blob/main/sql_etl/data/concept_map.txt

sxinger commented 2 years ago

@vasanthi014 or @kzraryan-mu can either of you show Christine how to directly stage a text file onto snowflake? or help her to stage it? The text file is: https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm/blob/main/sql_etl/data/concept_map.txt

@vasanthi014 @kzraryan-mu oh, just realized that you don't have access to that github repo. @spinkac can you download the text file and save it on our shared folder: \NextGen-BMI - Ogrp - Documents\GROUSE\CDM Enhancement\PCORnet2OMOP

spinkac commented 2 years ago

@vasanthi014 or @kzraryan-mu can either of you show Christine how to directly stage a text file onto snowflake? or help her to stage it? The text file is: https://github.com/PEDSnet/pcornetcdm_to_pedsnetcdm/blob/main/sql_etl/data/concept_map.txt

@vasanthi014 @kzraryan-mu oh, just realized that you don't have access to that github repo. @spinkac can you download the text file and save it on our shared folder: \NextGen-BMI - Ogrp - Documents\GROUSE\CDM Enhancement\PCORnet2OMOP

@sxinger @vasanthi014 @kzraryan-mu The file is now in the folder.

kzraryan-mu commented 2 years ago

I don't have access to either the GitHub or the GROUSE folder under NextGen-BMI - Ogrp - Documents.

By Staging the text file, I believe you are expecting to create a table out of some data in CSV/TSV format from a text file. The following URL will guide you on how to stage a CSV into a table. It is the same for text files as well. Just change it from CSV to text.

https://github.com/Missouri-BMI/P_2_SASDB_EDC/blob/main/README.md

sxinger commented 2 years ago

I don't have access to either the GitHub or the GROUSE folder under NextGen-BMI - Ogrp - Documents.

By Staging the text file, I believe you are expecting to create a table out of some data in CSV/TSV format from a text file. The following URL will guide you on how to stage a CSV into a table. It is the same for text files as well. Just change it from CSV to text.

https://github.com/Missouri-BMI/P_2_SASDB_EDC/blob/main/README.md

Just granted you the access @kzraryan-mu

sxinger commented 2 years ago

@spinkac can you work with @kzraryan-mu to get the text file staged.

spinkac commented 2 years ago

@sxinger @kzraryan-mu I think that Rana is going to need to either help me or do this for me. The text file is tab delimited, and there are commas in the contents. I am not going to be able to convert it to a csv, and I don't have the expertise to change the csv format to a tab delimited without tons of research. If Rana already knows how to do this, that would be a huge help. To clarify - Rana gave a resource to do this with a csv, but I cannot easily modify that to this case without help.

vasanthi014 commented 2 years ago

@sxinger @kzraryan-mu I think that Rana is going to need to either help me or do this for me. The text file is tab delimited, and there are commas in the contents. I am not going to be able to convert it to a csv, and I don't have the expertise to change the csv format to a tab delimited without tons of research. If Rana already knows how to do this, that would be a huge help.

I will discuss this with @kzraryan-mu and other engineers during the sprint call on Monday(04/18/22). One of us will be able to help you stage the text file to snowflake based on availability. @spinkac Feel free to join the call if you are available. Meeting should be in your calendar, let me know if you want me to resend it.

spinkac commented 2 years ago

@vasanthi014 Perfect - I do have it on the calendar and I will plan to join. Thank you!

spinkac commented 2 years ago

@vasanthi014 @sxinger With some help from Song, I think that I have successfully staged the text file. Here's hoping that I got the formats for all of the columns correct.

spinkac commented 2 years ago

@sxinger @vasanthi014 @mosaa @kzraryan-mu In the code provided by PATRONUS, they place the table I just staged in its own schema. Can anyone think of a reason why we would want to do that? I just placed it in the same folder as the other OMOP files that are being created.

mosaa commented 2 years ago

@spinkac: I din't fully understand the question. Perhaps, I ma lacking some context. One way to sometimes discuss those issues is joining the morning scrum calls and discuss with the team.

vasanthi014 commented 2 years ago

@spinkac Where did you stage the file? can you share the table, schema, database name in snowflake. Also, what data is present in the file you are referring to. In the above comments the urls to concept_map.txt file are broken.

spinkac commented 2 years ago

@mosaa @vasanthi014 I placed the file in the PCORNET_TO_OMOP schema in the ANALYTICSDB database using the table name pedsnet_pcornet_valueset_map. I think that @mosaa has access to the codes, although apparently we did not ask for access for @vasanthi014 . This is a crosswalk table that is used to look up PCORNET codes for the transformation to OMOP (~6,000 lines).

sxinger commented 2 years ago

@spinkac as we briefly talked today, here is what we decided on the next steps:

test out transformation script in SAS and decide which route would be faster to be implemented: SAS vs. Snowsql
we evaluated the approach in R/Python, which will still be database specific. i.e., it will be equivalent of writing the transformation codes in SnowSQL.
continue with the approach (SAS or SnowSQL) that can be achieved with the shortest turn-around time

Missouri-BMI / OMOP_ON_P_CDM

Explore PCORNet CDM to OMOP Conversion #1