Re-generate S5 panels - Githubissues

sjspielman commented 2 years ago

Closes #1492 This PR re-runs the script figures/scripts/supp-S5-panels.R to re-generate the three S5 panels with V22.

The scatterplots don't have any changes (same R and P-values), but the ROC plot shows a rather different AUC for the shuffled group of 0.32. This is lower than one expects the lower threshold of an AUC to be, but may not be unreasonable for the shuffled group.

gwaybio commented 2 years ago

It seems like shuffling is happening without setting a random seed. With low sample sizes, large fluctuations in the shuffling might be expected. In these cases, we typically recommend shuffling a couple times (~100) and plotting the shuffle range.

If the model is trained on shuffled x matrix data once (and applied to real test data y), then we typically don't see large fluctuations

sjspielman commented 2 years ago

Having a look at the relevant script, it appears a random seed is set for numpy here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/bfb7d798d37ab490392821f57c8bcf7b39a5a092/analyses/tp53_nf1_score/01-apply-classifier.py#L56

And a numpy method is used to perform the shuffling as well. I'm going to explore this a but further to see if this seed is indeed properly applied.

Edit: I've run the module a couple times now, and have thoughts!

The first time I ran the module, there were some diffs in the shuffled results, but all diffs were on tiny numerical tolerance differences (e.g. 12th decimal point).
I ran it several more times, and the numerical tolerance differences were the same, aka reproducible today. This suggests to me perhaps I should not have seen those diffs in first place.
Either way, the shuffled AUC for polya is now 0.38! Not much higher than 0.32, and certainly not ~0.5, but different enough to matter.

I am also thinking about why I see these initial diffs. I'm now realizing it's possibly because in #1461 I did not use the flag OPENPBTA_TP53_FIGURES=1. This flag is necessary to use the data in data/ instead of inside other modules. I'm using this flag today, but I wasn't before. https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/bfb7d798d37ab490392821f57c8bcf7b39a5a092/analyses/tp53_nf1_score/run_classifier.sh#L19

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/bfb7d798d37ab490392821f57c8bcf7b39a5a092/analyses/tp53_nf1_score/run_classifier.sh#L66-L77

Therefore, it might not hurt for me to submit a fresh PR that runs the module and sets the proper flag. @jharenza what do you think? This will ensure the current version of the module was run with the right flags, and furthermore, I can also add helpful documentation into the module about the flag and why/when to use it, which includes updating this pretty-deprecated comment: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/bfb7d798d37ab490392821f57c8bcf7b39a5a092/analyses/tp53_nf1_score/run_classifier.sh#L9-L10

I don't think we'll be getting away from this 0.38 AUC without shuffling 1000x or so as @gwaybio suggested, but I'm hesitant to change the module at this point vs. contextualize the low AUC for this supplemental figure in the manuscript.

jharenza commented 2 years ago

it might not hurt for me to submit a fresh PR that runs the module and sets the proper flag

I think this sounds like a good idea. We had set up that flag during the v22 release because I found that the module was using other modules' data as input and specifically, collapse-rnaseq was not up to date and was missing samples, therefore, running the module using defaults was not capturing scores for all rna samples.

sjspielman commented 2 years ago

@jharenza I've re-run this with the updated TP53. I have a question for you as well - what do you think of the current axis labels for the scatterplots? Specifically, I was thinking maybe they should include the +1, e.g. TERC log(FPKM+1). What do you think?

jharenza commented 2 years ago

@jharenza I've re-run this with the updated TP53. I have a question for you as well - what do you think of the current axis labels for the scatterplots? Specifically, I was thinking maybe they should include the +1, e.g. TERC log(FPKM+1). What do you think?

Sure, we can do that, since TERT has a lot of 0s.

sjspielman commented 2 years ago

@jharenza to be clearer, this calculation was already performed, but the label didn't reflect it.

jharenza commented 2 years ago

@jharenza to be clearer, this calculation was already performed, but the label didn't reflect it.

Oh! yes, then we should update the label!

sjspielman commented 2 years ago

@jharenza Updated to log2(FPKMK+1)! I went ahead and specified log2 as well, consistent w/ calcs in the script.

sjspielman commented 2 years ago

This code does not run through CI so we can merge before checks.

AlexsLemonade / OpenPBTA-analysis

Re-generate S5 panels #1515