Closed fedorov closed 4 years ago
FYI @wlongabaugh @G-White-ISB
I'm not quite sure what you mean by 'deprecate the content.' Do you mean to push a new commit to the repo with the old scripts deleted and the notebook added in place? That would make sense.
Do we have any protocols worked out for ETL processes in general? I haven't come across anything in our docs. I think ETL code is as critical as any other code if it generates data that may be used in production.
Do you mean to push a new commit to the repo with the old scripts deleted and the notebook added in place? That would make sense.
Something like that. We should decide if we should use notebook as the primary, or have a standalone script.
Do we have any protocols worked out for ETL processes in general? I haven't come across anything in our docs. I think ETL code is as critical as any other code if it generates data that may be used in production.
No, I don't think we have any protocols. Do you have an example of such protocol? What would you want to be determined by that protocol?
Our ETL will partially (mostly?) rely on GHC ETL for extracting metadata into BQ tables. The way I see it, our ETL scripts will just shuffle GHC-generated content around.
We most definitely need to keep track of the code used to do that shuffling in a repository, and ideally we should have code reviews for such critical pieces of code. That's the basics of the protocol I would suggest.
We can/should discuss at the Friday meeting.
ISB-CGC ETL is migrating to small task-oriented scripts to do each ETL operation. After trying to use Jupyter notebooks for ETL, we decided to go to straight scripts driven by a yaml configuration file that can be archived.
I look forward to learn more details about this to understand what it means.
Along with keeping scripts in a repo I would think we would also want to keeps logs of any ETL process, ie these scripts were applied to this table on this date etc. Unfortunately Bill will be away for today's meeting.
On a related note, can we filter out and keep in a log forever all write queries for every table we maintain?
I will check on that before the Friday call, George
On Fri, Feb 14, 2020 at 5:42 PM Andrey Fedorov notifications@github.com wrote:
On a related note, can we filter out and keep in a log forever all write queries for every table we maintain?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ImagingDataCommons/ETL/issues/2?email_source=notifications&email_token=AN54ORZK57WYKFFIXC6PDY3RC5CA5A5CNFSM4KUETSXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL26MVA#issuecomment-586540628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN54ORYGR7OAQ2DYAKIGMZ3RC5CA5ANCNFSM4KUETSXA .
this is not relevant anymore at this point
Note that the table populated using the output of those scripts does not have the Rows column initialized correctly. Rows needed to be quoted in tick-marks, since it is a protected word for BQ. This was not considered, and Rows in
idc-dev-etl:idc.dicom_all
is all nulls.I made a Colab notebook automating all of the steps in this repository, so I don't think we need to fix it or regenerate the output. I just wanted to make note of this.
You can see the notebook here: https://colab.research.google.com/drive/1PDwmjKNcqoqR0boHqiAW8vZcH9kUqQSl
The resulting table is
idc-dev-etl:tcga.dicom_metadata
.I wonder if we should deprecate the content of this repo and replace it with the notebook above, to avoid confusion?