AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 67 forks source link

Zenodo CSVs for oncoprint plots #1713

Closed sjspielman closed 1 year ago

sjspielman commented 1 year ago

Ever towards #1692

This PR adds some CSVs for oncoprint plots, in Figure 2 and S3b. There are 5 new CSVs total for each of the oncoprints.

I exported data that roughly matches what was described in https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/1692#issuecomment-1479951510, except I made it in a "long" format for more interoperability.

FGFR1 H3F3A BRAF
Sample1 KIAA1549--BRAF
Sample2 p.K28M
Sample3 Amplification

To accomplish this, I modified the function that creates the maf objects to also create an analogous data frame that can be exported. Note that I also at one point made sure to keep broad_histology_display inside of histologies_df so that column could make it into export. The figure 2 CSV files are also compressed because 1 of them was 20 MB; this is well below github's limit, but I wanted to keep overall size low here. If we want, I can just gz that particular file, or I can not gz anything.

jharenza commented 1 year ago

Hi @sjspielman! Thanks for starting this. I think that these tables are huge for a few reasons: 1) We want them to show only what is in the oncoprint, that is, the top 20 altered genes (from the goi list for the histology) and samples with those alterations. We could add all samples, including ones not shown, and just have NA or blanks in those genes' spaces for those samples. You might have to apply the get_histology_goi function, or subset by the goi for that oncoprint and these variant classifications. 2) Looks like all variant classifications are included here, but I am actually not seeing fusions or CNVs - we would want to subset to only the variant classifications which get displayed - eg there are a ton of genes with unknown symbol and which are IGR classification not being displayed. IGR shouldn't be in the output per 1 above. 3) There are duplicate rows. I think this is happening because there are multiple alterations (eg two missense) for one sample in the same gene.

I don't think the clinical info is 100% necessary for this table, so the wide format can still be accomplished, but I suppose I can be convinced of long as well - long is back to something like a MAF format without all of the columns. I suggested a wide format oncomatrix since it is a format which can be input to pedcbio or complexheatmap for oncoprint generation. There are two ways to make the oncomatrix - one is to just use general terms (more suitable for oncoprint generation) and one is to write the exact alterations per gene (would need wrangling to be useful for own oncoprint generation). Based on the review, I think we should do the latter, so long or wide probably doesn't matter (ie exact protein change for SNV, Amp/Del for CNV, fusion partners for fusion, as in the comment above). I think reviewer wants to be able to query which samples have KIAA1549--BRAF fusions or which samples have BRAF V600E?

One other thought is we can theoretically export all alterations in any histology-specific goi rather than top 20. I can go either way on this since we plan to export the entire alteration matrix (of the combined goi, or full MAF??, not sure we decided or discussed) for all samples as a separate table.

Sorry for the long explanation, but hopefully that makes sense.

sjspielman commented 1 year ago

Thanks for the feedback, @jharenza! I've started updating this. I should have more simply been using bind_rows() instead of _join() before, and I've filtered to GOIs.

I'd like to bring up a related point before going further - that while running this through I actually don't see 20 genes in the end for each GOI list Below I've just copied out of R console so you can see; I realize looking at Figure 2 though, it should have been clear to us already that there are sometimes <20!

The reason is from this line in the GOI function: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/868c3bfd6fbcb6c806935398a3e929f7f181c12a/figures/scripts/fig2_figS3-oncoprint-landscape.R#L70

Many rows of N=1 here lead to <20 genes being shown. I see two solutions (CC @jaclyn-taroni):


LGG

7 NA genes

[1] "BRAF"     "KIAA1549" "FGFR1"    "NTRK2"    "PIK3CA"   "QKI"      "KRAS"     "RAF1"    
[9] "MYB"      "ROS1"     "TP53"     "ATRX"     "FGFR2"    NA         NA         NA        
[17] NA         NA         NA         NA    

HGG

0 NA genes

[1] "TP53"   "H3F3A"  "ATRX"   "PIK3CA" "MET"    "NF1"    "EGFR"   "TERT"   "PTPN11"
[10] "PTEN"   "PPM1D"  "ACVR1"  "PDGFRA" "NTRK3"  "BCOR"   "PIK3R1" "ROS1"   "IGF1R" 
[19] "KIT"    "ATM"   

Embryonal

0 NA genes

[1] "CTNNB1"   "DDX3X"    "KDM6A"    "KMT2C"    "KMT2D"    "TP53"     "SMARCA4"  "PTCH1"   
[9] "KSR2"     "TERT"     "MYCN"     "ATM"      "CDK6"     "TCF4"     "ZIC1"     "PRKAR1A" 
[17] "MIR512-2" "APC"      "CTDNEP1"  "FOXO3"   

Other CNS

5 NA genes

[1] "CTNNB1"   "C11orf95" "RELA"     "NF2"      "EWSR1"    "FGFR1"    "TACC1"    "YAP1"    
 [9] "ERBB4"    "QKI"      "PDGFRA"   "BRAF"     "MAML2"    "MAMLD1"   "SETD2"    NA        
[17] NA         NA         NA         NA     
jharenza commented 1 year ago

Thanks @sjspielman! Totally fine with the path of least resistance- updating captions!

sjspielman commented 1 year ago

@jashapiro I've requested another look from you here, since I found a bug when going back through. I had been selecting Variant_Type but it should have been Variant_Classification. Now there are actual alterations in the output CSVs instead of all NA!

sjspielman commented 1 year ago

Code does not run through CI, so can merge before checks are complete. Also noting that from separate discussion, we are going to keep the tables in long form for more interoperability with any downstream platform.