d3b-center / OpenPedCan-analysis

The analysis repository for the Open Pediatric Cancer Project
https://d3b-center.github.io/OpenPedCan-analysis/
Other
18 stars 14 forks source link

Create automated workflow for `histologies-base.tsv` #391

Closed jharenza closed 10 months ago

jharenza commented 2 years ago

What data file(s) does this issue pertain to?

histologies-base-adapt.tsv
histologies-base.tsv

What release are you using?

v11

Put your question or report your issue here.

@HuangXiaoyan0106 - can you create a workflow based on this QC code to automate generation of histologies-base.tsv from histologies-base-adapt.tsv (the histologies file from the D3b warehouse?

If possible, can we use this repo and not use the folder structure as in d3b-codes?

Please let me know if you have any questions.

cc @aadamk

HuangXiaoyan0106 commented 2 years ago

@jharenza This week I'm working on wrapping up the clinical report CWL. Probably I'll start working on this next week.

jharenza commented 2 years ago

Thanks!

HuangXiaoyan0106 commented 2 years ago

@jharenza Some points want to confirm with you,

  1. You just want to run add_v11_updates.py with the files in the input folder to generate the histology file, right? Other scripts(01-samples_to_add.R/02-path_dx_mapping.R...) will not be run. Briefly, histologies-base-adapt.tsv. ---python add_v11_updates.py---> histologies-base.tsv

  2. Which one does histologies-base-adapt.tsv refer to? There are prod_reporting.openpedcan_histologies and prod_reporting.pbta_histologies in DW. Or refer to another one?

  3. The frequency to do this QC? Once? daily? weekly? monthly?

  4. Where do you want to put the histologies-base.tsv file? D3b warehouse or s3 bucket? The github repo is not a good place.

jharenza commented 2 years ago

@jharenza Some points want to confirm with you,

  1. You just want to run add_v11_updates.py with the files in the input folder to generate the histology file, right? Other scripts(01-samples_to_add.R/02-path_dx_mapping.R...) will not be run. Briefly, histologies-base-adapt.tsv. ---python add_v11_updates.py---> histologies-base.tsv

Actually, no. We will only want to run the scripts in the shell script. The python code was used to generate histologies files for TCGA, TARGET, GTEX, GMKF NBL, which were uploaded to the Data Tracker and then pulled into the WH, so they do not need to be regenerated using this code again.

  1. Which one does histologies-base-adapt.tsv refer to? There are prod_reporting.openpedcan_histologies and prod_reporting.pbta_histologies in DW. Or refer to another one?

prod_reporting.openpedcan_histologies

  1. The frequency to do this QC? Once? daily? weekly? monthly? There are two steps to this QC, not immediately clear. I am pasting below:

histologies-file-generation.pdf

We should create a weekly QC which runs the histologies-base-adapt.tsv against itself each week.

We will want to create a release-based QC (theoretically we can probably do this monthly for now) generating histologies-base.tsv and comparing this to histologies.tsv of the previous release, in this case OpenPedCan v11. Once v12 is released, that release would change to v12 for comparison.

  1. Where do you want to put the histologies-base.tsv file? D3b warehouse or s3 bucket? The github repo is not a good place.

I think until we have a process, maybe S3 for now. Can we create a specific histologies folder in 538745987955 (since this one is not a public bucket)?

HuangXiaoyan0106 commented 2 years ago

@jharenza I made the PR:https://github.com/d3b-center/histologies-qc/pull/1, please review it if you get a chance.

jharenza commented 10 months ago

This is now further automated in the histologies-qc repo