Fix v15 breaking changes

jaclyn-taroni commented 4 years ago

To quote the release notes being added in #569, we're changing the names of well-enough-used columns in the clinical file:

Change disease_type_old to pathology_diagnosis and disease_type_new to integrated_diagnosis per request in comment.

I know this change to pbta-histologies.tsv will break a number of things. The purpose of this issue is to track what will need to be changed as a result. Not only will the column names need to be updated, but we will also need to rerun any notebooks, change documentation, etc.

Anticipated issues

Here I'll list what I know needs to change in modules that are not deprecated.

Some of the modeling steps of gene-set-enrichment-analysis use disease_type_new: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/gene-set-enrichment-analysis/02-model-gsea.Rmd#L123 Luckily the gsva_anova_tukey function is already flexible! https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/gene-set-enrichment-analysis/util/hallmark_models.R#L35
The first step of interaction-plots uses the disease_type_new column to generate lists of samples: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/interaction-plots/scripts/01-disease-specimen-lists.R#L97 Documentation associated with that option will also need to change.
We filter out ATRT and MB samples in molecular-subtyping-embryonal using disease_type_old https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-embryonal/01-samples-to-subset.Rmd#L128 and check disease_type_new as well: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-embryonal/01-samples-to-subset.Rmd#L136 Also in this module, we use both the disease_type columns in the subtyping and generating final tables quite a bit starting around https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-embryonal/04-table-prep.Rmd#L325 The README for this module needs to change as well + this documentation: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-embryonal/02-generate-subset-files.R#L6
The subset files step of molecular-subtyping-EPN uses disease_type_new https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-EPN/00-subset-for-EPN.R#L55
In molecular-subtyping-HGG, we use disease_type_new quite a bit for classification based on defining lesions, here's just one example from the first notebook: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-HGG/01-HGG-molecular-subtyping-defining-lesions.Rmd#L59
disease_type_new is one of the "layers" associated with all of the plotting in sample-distribution-analysis: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/sample-distribution-analysis/01-filter-across-types.R, https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/sample-distribution-analysis/02-multilayer-plots.R and is used in the tables generated in 03-tumor-descriptor-and-assay-count: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/sample-distribution-analysis/03-tumor-descriptor-and-assay-count.Rmd#L206
selection-strategy-comparison includes consideration of disease_type_new: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/selection-strategy-comparison/01-selection-strategies.rmd#L169 We may want to just deprecate this analysis at this point rather than try to maintain it?

Issues that have arisen as part of #576

molecular-subtyping-chordoma fails with the following:

Quitting from lines 159-168 (01-Subtype-chordoma.Rmd) 
    Error in .f(.x[[i]], ...) : object 'SMARCB1' not found
    Calls: <Anonymous> ... <Anonymous> -> vars_rename_eval -> map_if -> map -> .f

That's from this chunk: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.Rmd#L166 I suspect what is actually happening is that there are no chordoma samples in the expression data used in CI and this step https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/molecular-subtyping-chordoma/01-Subtype-chordoma.Rmd#L147 We may want to take an approach that is similar to other subtyping modules and have the first step be a script that generates files that consist only of chordoma samples that are committed to the repository.

The Add Shatterseek step of sv-analysis, which is Rscript analyses/sv-analysis/02-shatterseek.R fails with:

Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'scratch/sv-vcf/BS_K07KNTFY_withoutYandM.tsv': No such file or directory
Execution halted

analyses/sv-analysis/02-shatterseek.R uses an independent specimen file, which is included in its entirety in CI, to read in files:

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/analyses/sv-analysis/02-shatterseek.R#L48

The step that would have generated scratch/sv-vcf/BS_K07KNTFY_withoutYandM.tsv comes prior to this one in CI

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/286ff25022930024bb9812e3cfad5410a2cf49c8/.circleci/config.yml#L142

So it will only have access to the subsetted Manta file. See https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/449#issuecomment-576021400 and https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/449#issuecomment-580441012 for more context. The sv-analysis module should be more robust to "missing" samples.

Next steps

@cansavvy @cbethell @jashapiro I'd recommend splitting this up such that modifications to each module are in separate pull requests so you can go through any make sure you catch any documentation stuff I may not have come across.

jaclyn-taroni commented 4 years ago

Because of the extent of these breaking changes, I believe it's prudent to handle each module in a separate pull request as stated above. Unfortunately, that means that CI will fail for a bunch of these fixes until the last one goes in. So here is the general procedure I think we should follow:

First and foremost, state which module you'll be working on on this issue so we don't duplicate effort!
When you file a pull request for a particular module, you should comment out all the modules that come before it in CI to quickly demonstrate that your fix works. This should be done in a single commit.
Once you're changes have been reviewed, you've made any changes as the result of review, and approved, revert the single commit that commented out all the modules before the step your PR pertains to.
We should merge the fix without the check passing.

This procedure has a significant weakness in that there may be changes introduced in any one fix that will cause CI to fail unexpectedly once the final fix goes in and this issue is closed. Once this issue gets closed, #569 should be updated such that it is in sync with master and that will test all steps in CI (except fusion-summary but that is tracked in #578). Any additional CI fixes can go into the AlexsLemonade:update-release-docs-v15 branch provided that they are small in scope.

cansavvy commented 4 years ago

So we can keep track of progress on these, I took @jaclyn-taroni 's list above and made it into a checklist. We can claim items and then check things off as we fix them. I'll start by claiming this first item. I'll put the PR number next to it too when I get it filed.

v15 breaking changes TODO list

[x] gene-set-enrichment-analysis - @cansavvy #585
[x] interaction-plots - @jashapiro #582
[x] molecular-subtyping-embryonal - @cbethell #591
- 01-samples-to-subset.Rmd
- 04-table-prep.Rmd
- docs in 02-generate-subset-files.R
- README
[x] molecular-subtyping-EPN - @jashapiro #592
[x] molecular-subtyping-HGG - @cbethell #586
[x] sample-distribution-analysis @jashapiro #584
- 01-filter-across-types.R
- 02-multilayer-plots.R
- 03-tumor-descriptor-and-assay-count
[x] selection-strategy-comparison - deprecated in #589 @jashapiro
[x] molecular-subtyping-chordoma error - @cansavvy #590
[x] sv-analysis 02-shatterseek.R error - @cansavvy #587 (no actual changes were made this was a false alarm. See comments below).

jaclyn-taroni commented 4 years ago

I just realized that the sv-analysis failure may simply be due to the fact that I had commented out the first script in that module. So the scope of that fix may be to group those near each other in .circleci/config.yml or to add a shell script to that module.

cansavvy commented 4 years ago

Okay. Well I just started working on it now, I'll see if that's it.

cansavvy commented 4 years ago

@jaclyn-taroni you were right. It is fine if the first script is ran. #587

jaclyn-taroni commented 4 years ago

Okay 👍 - would love to see those organized such that the step that was failing was immediately after the step that it depends on (perhaps after v15 is out out). I think that would have increased the chances I noticed that immediately.

cansavvy commented 4 years ago

Okay 👍 - would love to see those organized such that the step that was failing was immediately after the step that it depends on (perhaps after v15 is out out). I think that would have increased the chances I noticed that immediately.

I was about to just make this change when I had the branch open but I didn't know if there were particular sequential orders to some of the other tests and didn't want to throw another possible wrench in our testings here. But yeah, we may even want to have a bash script that calls both and make it one CircleCI test.

cansavvy commented 4 years ago

@jaclyn-taroni In regards to selection-strategy-comparison and your comment:

We may want to just deprecate this analysis at this point rather than try to maintain it?

I don't know enough about this analysis module to make an informed decision on this. Do we want to retire it though?

jaclyn-taroni commented 4 years ago

@jashapiro - what do you think, time to retire selection-strategy-comparison ?

jashapiro commented 4 years ago

I think it can be deprecated. Will do that now.

jashapiro commented 4 years ago

We did it, everyone! Changes incorporated to master in #569

AlexsLemonade / OpenPBTA-analysis