AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
99 stars 66 forks source link

Add OpenPBTA v21 (GitHub release 1) publication study to PedcBio #1185

Closed jharenza closed 2 years ago

jharenza commented 2 years ago

Description

This template is used to start a request to load or update a study onto the Kids First PedcBioPortal

Common - any new study REQUIRED

  1. If this is the first time being loaded, please fill out the cbio meta_study info, for instance:

    type_of_cancer: brain
    cancer_study_identifier: openpbta
    name: Open Pediatric Brain Tumor Atlas (OpenPBTA)
    description: The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href="https://www.ccdatalab.org/">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href="https://www.chop.edu/">Children's Hospital of Philadelphia's</a> <a href="https://d3b.center/">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href="http://cbtn.org">Children's Brain Tumor Network</a> and the <a href="http://www.pnoc.us/">Pacific Pediatric Neuro-oncology Consortium</a>through real-time, <a href="https://github.com/AlexsLemonade/OpenPBTA-analysis">collaborative analyses</a> and <a href="https://github.com/AlexsLemonade/OpenPBTA-manuscript>collaborative manuscript writing</a> on GitHub. 
    
    short_name: openpbta
  2. Provide an example of a sample ID that can be used to tie together DNA and RNA (if applicable), aka a "somatic event ID":

  3. Load/access control:

    • [ ] Load in QA Only
    • [x] Load in Prod
    • [ ] DO NOT LOAD AS PUBLIC. USE GROUP NAME:

Kids First/PBTA

Publication/Collaboration

Publication is obvious, as Collaboration study would be something like OpenPBTA, OpenTargets, or other custom request

Please provide the following:

  1. A link to the paper (if applicable): https://alexslemonade.github.io/OpenPBTA-manuscript/
  2. Link(s) or a description of where to find the genomic data to load. Acceptable types of data to load are

Data will be added here: s3://kf-openaccess-us-east-1-prd-pbta/data/pedcbio/ by @runjin326

  1. Patient metadata that is available s3://kf-openaccess-us-east-1-prd-pbta/data/pedcbio/pbta-histologies.tsv (v21)

  2. Sample metadata that is available

QA Review

Revisit this section once the project is loaded onto to QA as a minimum push-to-prod and/or close-ticket checklist

jharenza commented 2 years ago

@migbro will work on this once the data are in place

jharenza commented 2 years ago

Also note: please use the same columns (except remove cohort column) and nomenclature as with the OpenTargets project, having been derived from the histologies file v21, not the data service or D3b WH, so we are synced with our GitHub release. Thank you!

jharenza commented 2 years ago

@migbro, what will your timeline for this be? thank you!

migbro commented 2 years ago

I'll start work on this this week

migbro commented 2 years ago

Hmmm, working on loading this project. I am trying to use the same sample ids I used for pbta_all load for this one. For sample 7316-466 there were two RNA samples. For pbta_all I loaded BS_SHZZ99DT, which is totalRNAseq, ribo-depleted, but openPBTA used the polyA one. @jharenza I am guessing you will stick with that?

migbro commented 2 years ago

Also, the merged maf file is missing the typical #version 2.4 header line. I can easily work around that with a simple flag, but not sure if that is intentional.

runjin326 commented 2 years ago

@migbro, oh yes I forget to add the header line after merging the file - I can fix that. For the RNA sample that you mentioned, this is weird since when I use the histology file - it looks like this sample only has one RNA-Seq, which is poly-A and BS_0VXZCRJS. And BS_SHZZ99DT is not in the histology file or the gene expression RDS. Any idea how you got the sample loaded?

migbro commented 2 years ago

So, it's not that I had loaded the RDS. What I had done was use the pbta_all data_sample_sheet to harmonize sample naming (it's a long story) in cBio so that DNA and RNA data can be tied together 1-to-1. That polyA version was not in the pbta_all cBio study study, just the stranded ribo-depleted run of that same sample, as currently we are not loading technical replicates. So for now, I will just ignore that and load what you are actually using when there is a conflict.

runjin326 commented 2 years ago

@migbro, gotcha! Thanks! I have now uploaded the fixed SNV in the box now.

migbro commented 2 years ago

Ok, loading on to QA now 🎉, will update when completed. Also:

Needed to get rid of \ in pathology free text for BS_8ZS9F31R

This issue still remains! Luckily I have a note about that from when I did the openTARGET load, so I was able to avoid this.

migbro commented 2 years ago

@runjin326 it's up! https://kf-strides-cbioportal-qa.kidsfirstdrc.org/study/summary?id=openpbta Please take a look and let me know id you see anything odd

runjin326 commented 2 years ago

@migbro, thanks so much! Looks cool to me but @jharenza can better assess if there is anything odd ;)

migbro commented 2 years ago

I do notice fewer variant calls...that's because of the consensus method of using 3/3 + hotspots, right?

runjin326 commented 2 years ago

I would assume so - since I did confirm the number of rows being correct after merging the two MAF files.

jharenza commented 2 years ago

Thanks! I'll review tonight 😀

migbro commented 2 years ago

Although, I forgot the seg file load I just realized. I can do that tomorrow.

jharenza commented 2 years ago

Hmmm, working on loading this project. I am trying to use the same sample ids I used for pbta_all load for this one. For sample 7316-466 there were two RNA samples. For pbta_all I loaded BS_SHZZ99DT, which is totalRNAseq, ribo-depleted, but openPBTA used the polyA one. @jharenza I am guessing you will stick with that?

hey @migbro - I had to do some digging, but this sample is actually one of those "polyA + stranded" libraries that BGI sequenced in error, as discussed in this ticket , and which were added in release v12, but subsequently taken out in v13 :https://github.com/AlexsLemonade/OpenPBTA-analysis/issues/373 - So, 1) if they are still visible in the DS (should be ok), we definitely need to annotate them as poly A prep AND stranded prep and 2) you probably should just use the polyA bs_id for the pbta_all study.

jharenza commented 2 years ago

Hi @migbro - I reviewed this. Can you add to column experimental_strategy to the front page, which would denote whether the DNA is WGS or WXS? Our oncoprints mostly match for:

but pretty off for diffuse astrocytic tumors - I think this is because I cannot select WGS only samples, which I need to do here to mimic OpenPBTA (we'll see).

Minor edit for the description: add a space between Pacific Pediatric Neuro-oncology Consortium and through.

Thank you!

migbro commented 2 years ago

Hmmm, working on loading this project. I am trying to use the same sample ids I used for pbta_all load for this one. For sample 7316-466 there were two RNA samples. For pbta_all I loaded BS_SHZZ99DT, which is totalRNAseq, ribo-depleted, but openPBTA used the polyA one. @jharenza I am guessing you will stick with that?

hey @migbro - I had to do some digging, but this sample is actually one of those "polyA + stranded" libraries that BGI sequenced in error, as discussed in this ticket , and which were added in release v12, but subsequently taken out in v13 :#373 - So, 1) if they are still visible in the DS (should be ok), we definitely need to annotate them as poly A prep AND stranded prep and 2) you probably should just use the polyA bs_id for the pbta_all study.

So what you are saying is that the one you currently have in the histologies file, you think that is that best representative (BS_0VXZCRJS), and the one I have in pbta_all ought to be replaced with that one, instead of the one I used, BS_SHZZ99DT? There were actually 39 that I saw in openPBTA, but not pbta_all. 14 of which were because the library was listed in data service as polyA and we preferred ribo-depleted, stranded. Attached I have that information: in_openpbta_not_pbta_all.csv pbta_all_used.csv

migbro commented 2 years ago

@runjin326 when you have a chance, can you give me a brief description of how the consensus maf is made? For instance, for D3b we have:

Consensus calls from strelka2, mutect2, lancet, and VarDict Java.  Two or more callers required to pass, < 0.001 frequency in gnomAD, and min read depth 8 in normal sample"

Although , I realize I forgot to mention hotspots in my description

migbro commented 2 years ago

Nm, I found this: https://github.com/AlexsLemonade/OpenPBTA-analysis/tree/master/analyses/snv-callers and remembered all that hot spot work, so I came up with: Consensus calls from strelka2, mutect2, lancet. All three callers must agree unless the variant falls in a TERT promoter or hotspot region (see https://www.cancerhotspots.org) If that's not correct, just let me know!

jharenza commented 2 years ago

Hi @migbro - is this for release notes? Can you just link to the consensus and hotspot modules within OpenPBTA? You will also have to do this with CNVs for consensus and Fusions for putative oncogenic. There are a few nuances and the very brief description may not suffice.

migbro commented 2 years ago

It's more for the study meta files. for instance:

cancer_study_identifier: openpbta
stable_id: mutations
profile_name: Mutations
profile_description: Consensus calls from strelka2, mutect2, lancet.  All three callers must agree unless the variant falls in a TERT promoter or hotspot region (see https://www.cancerhotspots.org)
genetic_alteration_type: MUTATION_EXTENDED
datatype: MAF
show_profile_in_analysis_tab: true
data_filename: data_mutations_extended.txt

Each data type has this corresponding file. I can't remember where one sees these descriptions...

migbro commented 2 years ago

Ok, I have updated the study with experimental strategy and fixed the descirption typos.

migbro commented 2 years ago

@jharenza ok, I have updated the study. If there is nothing else, I shall push it to prod

runjin326 commented 2 years ago

@migbro, I am in the process of checking the CNV file loaded and there might be an update on the file being used. I will ping you once I complete that. Maybe we can wait to push to prod? Thanks!

jharenza commented 2 years ago

study is up here - thanks @migbro and @runjin326 !!