My experiences using the unm-improvements branch - Apr 14

boxysean commented 8 years ago

Hey folks,

First off, thank you for this resource connecting the synthetic dataset to OMOP, this is very helpful for me to evaluate how OMOP can benefit my work. This has saved me a ton of time.

Below is some unsolicited feedback after using the unm-improvements branch to generate sample patient data for my local OMOP CDM instance. I was referred to here from this discussion.

The get_synpuf_files.py utility was confusing. The README was correct, so doing python output 4 20 worked, but the feedback from the tool was telling me otherwise: output was the INPUT_DIRECTORY, 4 was the OUTPUT_DIRECTORY, and 20 was the SAMPLE_RANGE.
It looks like the get_synpuf_files.py is written in python3, but CMS_SynPuf_ETL_CDM_v5.py is python2. It seems like you folks are thinking about which to use, but to me, consistency within a single repo is the most important trait.
I wasn't sure where to find the omop_vocab_xref_0723.txt, so I ended up commenting out the section that builds the mapping xref.
The concepts need to be loaded into the OMOP CDM v5 database before the CSVs can be loaded.

I'm sure there's lots of internal discussion over on your end, but I would suggest possibly the following to make this really useful to the general public:

Try to make a single script that encapsulates the steps of the README. This process is pretty close to being done!
Release the results of running the ETL-CMS in a zip file on a publicly-accessible place. I could imagine publishing this on a regular basis, as the process and quality improves. It makes it much easier to have a conversation about quality, too.

Thanks again! Super helpful.

ChristopheLambert commented 8 years ago

Hi Sean,

Thanks for the feedback, and glad it was helpful. Let me respond to your feedback:

We will look into the confusing message
The script get_synpuf_files.py used to be in python3 -- in our branch, we converted it to 2.7 for just the reason you mentioned -- consistency. Are you sure you retrieved the unm-improvements branch? The change to 2.7 is documented in the header.
We didn't know how to get that file either, so we overhauled the program to directly read the OMOP vocabulary files as they come out of the box. Again, are you sure you retrieved the right branch? I can't even find a reference to that file in our branch.
We did not provide instructions on how to create the OMOP CDM v5 database, as we hadn't got there yet, but I agree it would be helpful to have the full soup-to-nuts instructions.
Great idea to have a script to run it all.
I would like to do release the results of running the ETL as a zip file as well. It will be quite large -- any suggestions where?

Thanks!

Christophe

pbr6cornell commented 8 years ago

Christophe, if you made a version of the dataset using your ETL, the coordinating center can host it on our amazon instance, and we can expose it via the OHDSI website. Lee Evans can help with those logistics. Thanks for your contribution, this is great!

On Thu, Apr 14, 2016 at 7:18 PM, Christophe Lambert < notifications@github.com> wrote:

Hi Sean,

Thanks for the feedback, and glad it was helpful. Let me respond to your feedback:

We will look into the confusing message

The script get_synpuf_files.py used to be in python3 -- in our branch, we converted it to 2.7 for just the reason you mentioned -- consistency. Are you sure you retrieved the unm-improvements branch? The change to 2.7 is documented in the header.

We didn't know how to get that file either, so we overhauled the program to directly read the OMOP vocabulary files as they come out of the box. Again, are you sure you retrieved the right branch? I can't even find a reference to that file in our branch.

We did not provide instructions on how to create the OMOP CDM v5 database, as we hadn't got there yet, but I agree it would be helpful to have the full soup-to-nuts instructions.

Great idea to have a script to run it all.

I would like to do release the results of running the ETL as a zip file as well. It will be quite large -- any suggestions where?

Thanks!

Christophe

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/OHDSI/ETL-CMS/issues/18#issuecomment-210200195

boxysean commented 8 years ago

Hey @ChristopheLambert, well false alarm. I was on 94540d02db59bd2ca5c0a3118702a5dcfb3990dc from master, thinking I was on unm-improvements. No wonder you were so confused, oops! :)

Looks like there's a lead as to where to put the output, excellent. I'll close the issue as most else what I said doesn't seem to apply. Thanks!

ChristopheLambert commented 8 years ago

Patrick, we will be sure to do that.

Sean, glad you reached out anyways. Let us know how it works out!

leeevans commented 8 years ago

Hi @ChristopheLambert how big is the SYNPUF CDMV5 dataset that you would like to share?

Do you have a preferred way to transfer it? ftp server? I can setup a temporary AWS S3 bucket for you to upload the dataset if needed.

You can send me a direct message on the OHDSI forum, or connect and message me on linkedIn to share the transfer connection details.

Thanks.

ChristopheLambert commented 8 years ago

Hi @leeevans, we are not finished yet, but estimate it will be 110GB uncompressed, and about 18GB compressed. SFTP would be fine.

OHDSI / ETL-CMS

My experiences using the unm-improvements branch - Apr 14 #18