Re-run focal-cn-file-preparation with subtyped V22

sjspielman commented 2 years ago

This PR re-runs the focal-cn-file-preparation with V22 after subtyping has been performed. As discussed in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1455#issuecomment-1163114816, this module does need to be re-run with the subtypes.

sjspielman commented 2 years ago

Noting this closes #1501

jharenza commented 2 years ago

@sjspielman I reran this on EC2. Comparing my local files from this PR:

harenzaj@38f9d38f36c9 results % md5sum *
7b2be9b3ce6d7be9dd2ec9f4b3c7372c  cnvkit_annotated_cn_autosomes.tsv.gz
ca024a823bd2816643f66437f6e7d144  cnvkit_annotated_cn_x_and_y.tsv.gz
fa0adcd26f408d840f25339d2a1ea09f  consensus_seg_annotated_cn_autosomes.tsv.gz
9589cd18d0e6c1f7c2d939126c63ec6c  consensus_seg_annotated_cn_x_and_y.tsv.gz
e8a789ea6f1c36c2ff5824a0129be1fd  consensus_seg_focal_cn_recurrent_genes.tsv
4b5cc072e3abf6f63b4b71901c302af3  consensus_seg_most_focal_cn_status.tsv.gz
d640ddf2193b2b4227d306414c849d6b  consensus_seg_recurrent_focal_cn_units.tsv
9e594dc648c9ec8d800cee86948ae8fd  consensus_seg_with_ucsc_cytoband_status.tsv.gz
6d443819228658c0a768e236aa2715ec  controlfreec_annotated_cn_autosomes.tsv.gz
70a8e9b09b5b0d4670e8146d612f8dbe  controlfreec_annotated_cn_x_and_y.tsv.gz

with files I got on EC2:

root@cef3727cf5b5:/home/rstudio/OpenPBTA-analysis/analyses/focal-cn-file-preparation/results# md5sum *
7b2be9b3ce6d7be9dd2ec9f4b3c7372c  cnvkit_annotated_cn_autosomes.tsv.gz
ca024a823bd2816643f66437f6e7d144  cnvkit_annotated_cn_x_and_y.tsv.gz
fa0adcd26f408d840f25339d2a1ea09f  consensus_seg_annotated_cn_autosomes.tsv.gz
9589cd18d0e6c1f7c2d939126c63ec6c  consensus_seg_annotated_cn_x_and_y.tsv.gz
e8a789ea6f1c36c2ff5824a0129be1fd  consensus_seg_focal_cn_recurrent_genes.tsv
4b5cc072e3abf6f63b4b71901c302af3  consensus_seg_most_focal_cn_status.tsv.gz
67a34fcc09ee1bc35c28b85b581dca33  consensus_seg_recurrent_focal_cn_units.tsv
e4968ae8add1db39735e0bbddeb34fbc  consensus_seg_with_ucsc_cytoband_status.tsv.gz
6d443819228658c0a768e236aa2715ec  controlfreec_annotated_cn_autosomes.tsv.gz
70a8e9b09b5b0d4670e8146d612f8dbe  controlfreec_annotated_cn_x_and_y.tsv.gz

It looks like there are diffs in consensus_seg_recurrent_focal_cn_units.tsv and consensus_seg_with_ucsc_cytoband_status.tsv.gz, though interestingly, only the latter was changed in your PR. I might need some more help evaluating whether these files are expected to change every run - cc @jaclyn-taroni and @jashapiro

jaclyn-taroni commented 2 years ago

We might expect anything using the cytoband status to change because of: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/2f2d41186a0a922173f24fa7a8b5ce672bd2ee7f/analyses/focal-cn-file-preparation/run-bedtools.snakemake#L70

jaclyn-taroni commented 2 years ago

It would probably be helpful to understand how exactly consensus_seg_recurrent_focal_cn_units.tsv differs between the local run and the run on EC2. It's harder to comment on that with just the knowledge that the checksums are different. I'd also encourage folks to dig into the history of the results for this module: https://github.com/AlexsLemonade/OpenPBTA-analysis/commits/master/analyses/focal-cn-file-preparation/results That might give you some clues about what to expect.

jharenza commented 2 years ago

Ok, the differences are in the short_histology - metastases rows are removed in my run, one embryonal tumor row added.

> dim(focal_jo)
[1] 13671     5
> dim(focal_steph)
[1] 13684     5
> unique(focal_jo$short_histology)
 [1] "All"             "Ependymoma"      "HGAT"            "Medulloblastoma" "LGAT"            "Meningioma"      "Embryonal tumor" "ATRT"            "Schwannoma"     
[10] "GNT"             "Ganglioglioma"  
> unique(focal_steph$short_histology)
 [1] "All"             "Ependymoma"      "HGAT"            "Medulloblastoma" "LGAT"            "Metastases"      "Meningioma"      "ATRT"            "Schwannoma"     
[10] "GNT"             "Ganglioglioma"

I thought maybe I ran the module one way (I ran default without specifying a param) and @sjspielman may have run another way. From consensus_seg_most_focal_cn_status.tsv.gz, the input to create consensus_seg_recurrent_focal_cn_units.tsv, I checked bs_ids in base and final hist for short_histology. In both files, the same bs_ids are annotated as Metastases (N=3). In base, there are 13 embryonal tumors and in final hist, 11. But additionally, the script generating this file uses pbta-histologies.tsv:

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/5908b3c87d1714c105ceffbfdf5b36615aa71bca/analyses/focal-cn-file-preparation/06-find-recurrent-calls.Rmd#L65

Since there is a cytoband column, it could be possible that if consensus_seg_with_ucsc_cytoband_status.tsv.gz changes slightly every time, if the downloaded file is different, then this file may also change. Do you think that is reasonable @sjspielman @jaclyn-taroni

sjspielman commented 2 years ago

@jharenza I ran it without any special settings: bash analyses/focal-cn-file-preparation/run-prepare-cn.sh. I'm going to re-run it again today to see if diffs occur again from my own previous run, too and try to have more a closer look. Something that might help - can you send me your result files maybe over google drive or something? That way I can also compare.

jharenza commented 2 years ago

@jharenza I ran it without any special settings: bash analyses/focal-cn-file-preparation/run-prepare-cn.sh. I'm going to re-run it again today to see if diffs occur again from my own previous run, too and try to have more a closer look. Something that might help - can you send me your result files maybe over google drive or something? That way I can also compare.

That's how I ran it, too. Sure - put the two files which differ here

sjspielman commented 2 years ago

I re-ran this module, and there are no diffs at all from my first V22 run (except a recompiled notebook with various JS differences). But I am noticing my docker image on my EC2 instance may be a bit out of date at 3 months old. I'm going to update the docker image and run again to see how that goes.

sjspielman commented 2 years ago

@jharenza I may have made some progress here! Re-pulling the docker image had no effect, but I deleted and re-downloaded V22 data and got more diffs. I suspect something, but not everything?, may have timed out on the EC2 instance when I had previously downloaded V22. Either way back on track now!

I compared consensus_seg_with_ucsc_cytoband_status.tsv.gz, and our rows are in different orders but results are the same. I also compared consensus_seg_recurrent_focal_cn_units.tsv and we are the same there too!

jharenza commented 2 years ago

Awesome!

AlexsLemonade / OpenPBTA-analysis

Re-run focal-cn-file-preparation with subtyped V22 #1479