Rows column populated incorrectly

fedorov commented 4 years ago

Note that the table populated using the output of those scripts does not have the Rows column initialized correctly. Rows needed to be quoted in tick-marks, since it is a protected word for BQ. This was not considered, and Rows in idc-dev-etl:idc.dicom_all is all nulls.

I made a Colab notebook automating all of the steps in this repository, so I don't think we need to fix it or regenerate the output. I just wanted to make note of this.

You can see the notebook here: https://colab.research.google.com/drive/1PDwmjKNcqoqR0boHqiAW8vZcH9kUqQSl

The resulting table is idc-dev-etl:tcga.dicom_metadata.

I wonder if we should deprecate the content of this repo and replace it with the notebook above, to avoid confusion?

fedorov commented 4 years ago

FYI @wlongabaugh @G-White-ISB

G-White-ISB commented 4 years ago

I'm not quite sure what you mean by 'deprecate the content.' Do you mean to push a new commit to the repo with the old scripts deleted and the notebook added in place? That would make sense.

Do we have any protocols worked out for ETL processes in general? I haven't come across anything in our docs. I think ETL code is as critical as any other code if it generates data that may be used in production.

fedorov commented 4 years ago

Do you mean to push a new commit to the repo with the old scripts deleted and the notebook added in place? That would make sense.

Something like that. We should decide if we should use notebook as the primary, or have a standalone script.

Do we have any protocols worked out for ETL processes in general? I haven't come across anything in our docs. I think ETL code is as critical as any other code if it generates data that may be used in production.

No, I don't think we have any protocols. Do you have an example of such protocol? What would you want to be determined by that protocol?

Our ETL will partially (mostly?) rely on GHC ETL for extracting metadata into BQ tables. The way I see it, our ETL scripts will just shuffle GHC-generated content around.

We most definitely need to keep track of the code used to do that shuffling in a repository, and ideally we should have code reviews for such critical pieces of code. That's the basics of the protocol I would suggest.

We can/should discuss at the Friday meeting.

wlongabaugh commented 4 years ago

ISB-CGC ETL is migrating to small task-oriented scripts to do each ETL operation. After trying to use Jupyter notebooks for ETL, we decided to go to straight scripts driven by a yaml configuration file that can be archived.

fedorov commented 4 years ago

I look forward to learn more details about this to understand what it means.

G-White-ISB commented 4 years ago

Along with keeping scripts in a repo I would think we would also want to keeps logs of any ETL process, ie these scripts were applied to this table on this date etc. Unfortunately Bill will be away for today's meeting.

fedorov commented 4 years ago

On a related note, can we filter out and keep in a log forever all write queries for every table we maintain?

G-White-ISB commented 4 years ago

I will check on that before the Friday call, George

On Fri, Feb 14, 2020 at 5:42 PM Andrey Fedorov notifications@github.com wrote:

On a related note, can we filter out and keep in a log forever all write queries for every table we maintain?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ImagingDataCommons/ETL/issues/2?email_source=notifications&email_token=AN54ORZK57WYKFFIXC6PDY3RC5CA5A5CNFSM4KUETSXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL26MVA#issuecomment-586540628, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN54ORYGR7OAQ2DYAKIGMZ3RC5CA5ANCNFSM4KUETSXA .

fedorov commented 4 years ago

this is not relevant anymore at this point

ImagingDataCommons / ETL

Rows column populated incorrectly #2