ARCH-commons / i2p-transform

i2b2 to PCORnet Common Data Model Transformation - requires i2b2 PCORnet ontology
Other
3 stars 16 forks source link

Monolithic transform file is inconvenient for building tables individually or in parallel #32

Closed njgraham closed 7 years ago

njgraham commented 8 years ago

Sometimes (especially during development/testing) we'd like to rebuild just the PRESCRIBING table, for example. At KU, we have a Jenkins job to run the CDM transform which is good for repeatability and logging but not so good for running select parts of the monolithic transform file.

Splitting the single transform (PCORNetLoader_ora.sql in the Oracle case) would be beneficial for building tables individually and may also provide the opportunity for building certain tables in parallel.

matthewjoss commented 8 years ago

I just spoke to Jeff about this, and he agrees that it is too much work to split the loader into many different files. However he did say that we could parse the run execution part into a new file, if you think that would be helpful. We may also do that anyways.

Are you aware that all of the most relevant parts of the transform are subdivided into stored procedures? The thing is, most of the procedures rely on the execution of other earlier procedures, which makes it more difficult to run the script in a piece-wise manner. For example, the Prescribing procedure relies on the prior execution of the encounter procedure and the demographics procedure. All of this is done automatically when you run the whole script.

If you just want to test the prescribing procedure, you can run the demographics and encounters procedures first (run demographics first, then encounters), then run the prescribing procedure (presuming you aren't dropping your tables in between each test). IE:

begin PCORNetclear; PCORNetDemographic; PCORNetEncounter; PCORNetPrescribing; end; /

This way you do not have to run the entire loader during your tests. Also Jeff mentioned that you can comment out the reliance on certain foreign keys within the prescribing procedure to avoid having to run other procedures, though this is not recommended.

njgraham commented 8 years ago

I am aware that the transform is divided into procedures and that there are dependencies - I've spent at least a few minutes looking at the code :)

One can certainly comment out bits of the code and run subsections manually, but as I mentioned before we've tried to take a more repeatable/automated approach and avoid having someone just run the .sql in SQLDeveloper manually. We do have a workaround - we have implemented sed commands that Jenkins generates from checkboxes in the UI. These checkboxes select which procedures within pcornetloader are removed right before the file is executed. But, that felt a bit hackish and still isn't friendly for parallel execution.

Speaking of parallel execution, I may very well have missed something but it seems like at least some of the transform could be run in parallel? Perhaps something like: AirflowDAG

I've spent a few minutes of hacking on the split_transform branch but I haven't actually tried to run it yet so who knows what dependencies I may have missed - you may be right and it is too much work to actually get it running.

Anyway, I think you're right - just splitting out the execution part sounds like it would achieve the goal. Splitting it up further is more of a style thing I guess. I want to share as much code as possible so I'd definitely want to come up with something we can agree on!

dckc commented 7 years ago

Looks like this is overtaken by #40