d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

Updated analysis: Addition of DGD samples to v11 histology file #251

Closed aadamk closed 2 years ago

aadamk commented 2 years ago

What analysis module should be updated and why?

v11 histology file

What changes need to be made? Please provide enough detail for another participant to make the update.

Addition of all MAF and fusion metadata corresponding to DGD subjects.

What input data should be used? Which data were used in the version being updated?

Attached file - derived from an 11/16/2021 pull of the DGD-genomics-file-manifest table (bix-workflows schema) merged with the 11/16/2021 pull of the pbta-histologies table (prod-reporting schema), both from the D3b warehouse.

pbta-histologies-dgd.txt

Note: cohort designation varies due to consent. Certain subjects that came through DGD may have been consented under the CBTN study, and have been designated as such. Those designated as 'DGD' under cohort have been consented under a DGD-only protocol and consented to be used for research under a D3b study.

When do you expect the revised analysis will be completed?

Who will complete the updated analysis?

@ewafula @sangeetashukla

ewafula commented 2 years ago

What analysis module should be updated and why?

v11 histology file

What changes need to be made? Please provide enough detail for another participant to make the update.

Addition of all MAF and fusion metadata corresponding to DGD subjects.

What input data should be used? Which data were used in the version being updated?

Attached file - derived from an 11/16/2021 pull of the DGD-genomics-file-manifest table (bix-workflows schema) merged with the 11/16/2021 pull of the pbta-histologies table (prod-reporting schema), both from the D3b warehouse.

pbta-histologies-dgd.txt

Note: cohort designation varies due to consent. Certain subjects that came through DGD may have been consented under the CBTN study, and have been designated as such. Those designated as 'DGD' under cohort have been consented under a DGD-only protocol and consented to be used for research under a D3b study.

When do you expect the revised analysis will be completed?

Who will complete the updated analysis?

@ewafula @sangeetashukla

Thank you, @aadamk! @runjin326, In the OT histologies file, all D3b study cohort designations (CBTN, and PNOC) are both renamed to PBTA. I am assuming that DGD will also be renamed to PBTA. I'll have this integrated into the v11 histologies by tommorow.

jharenza commented 2 years ago

Thanks @aadamk and @ewafula for working on this!

I am assuming that DGD will also be renamed to PBTA. I'll have this integrated into the v11 histologies by tommorow.

@ewafula we should keep DGD as DGD cohort- this will be a pan-cancer group of samples.

@aadamk for those consented under CBTN, I think it makes sense for us to continue using DGD as the cohort, as the cohort is specific to the sequencing strategy and the data source. So for instance, one patient may be CBTN for RNA-Seq and WGS, but DGD for fusion and DNA panel. We would traditionally keep these separate (such as with GMKF and TARGET overlapping samples), and then we would update the independent specimens module to account for the redundancy. The question is, do we want to treat the analysis of cohorts separately?

Alternatively, we might think of samples existing in both CBTN and DGD as a special case because of the clinical nature of the assays. We might think about updating or create a new module to account for clinical mutations and fusions wherein we combine the data from research and clinical - perhaps similar to a hotspot scavenging, in which we add clinical calls to the existing MAF and fusion files if not already present. Measuring the mutation and fusion frequencies here may be a bit tricky.

While we are thinking about this, consider we also have GMKF NBL tumor-only WXS for the same tumors in which we have T/N WGS. The idea for this design was that WGS calls would validate WXS calls which have deeper coverage. We have a somewhat similar but different design for some PNOC (WGS and WXS), and in the independent specimen list, we preferentially take WXS calls for mutation frequency tables. We could preferentially take DGD panel calls as a higher confidence set than WGS, but then we lose any calls not within the panel BED regions.

Thoughts/brainstorm around this?

Cc @afarrel @runjin326 @logstar

ewafula commented 2 years ago

There is no information for broad_histology, short_histology, and cancer_group in the DGD histologies file. I can infer a few of them from the pathology_diagnosis column referencing the info available in the v10 histologies but not all. Where can I get these histology terms, specifically cancer_group which is utilized in most of the OT analysis modules?

> df <- read_tsv("~/Downloads/pbta-histologies-dgd.txt")
> unique(df$pathology_diagnosis)
  [1] "Thyroid Gland Papillary Carcinoma"                                                      "Acute Myeloid Leukemia"                                                                
  [3] "B Lymphoblastic Leukemia/Lymphoma"                                                      "Atypical Teratoid Rhabdoid Tumor (ATRT)"                                               
  [5] "Low-grade glioma/astrocytoma (WHO grade I/II)"                                          "T Lymphoblastic Leukemia/Lymphoma"                                                     
  [7] "Metastatic secondary tumors;Sarcoma"                                                    "Ependymoma"                                                                            
  [9] "Ganglioglioma"                                                                          "Choroid plexus papilloma"                                                              
 [11] "B Acute Lymphoblastic Leukemia"                                                         "High-grade glioma/astrocytoma (WHO grade III/IV)"                                      
 [13] "Undifferentiated Round Cell Sarcoma"                                                    "Brainstem glioma- Diffuse intrinsic pontine glioma"                                    
 [15] "Other"                                                                                  "Not Reported"                                                                          
 [17] "Meningioma"                                                                             "Medulloepithelioma"                                                                    
 [19] "Rhabdomyosarcoma"                                                                       "Cavernoma;Ependymoma"                                                                  
 [21] "Craniopharyngioma"                                                                      "Embryonal Rhabdomyosarcoma"                                                            
 [23] "Choroid plexus carcinoma"                                                               "Recurrent Primary Mediastinal (Thymic) Large B-Cell Lymphoma"                          
 [25] "Medulloblastoma"                                                                        "Pilocytic Astrocytoma"                                                                 
 [27] "Neuroblastoma"                                                                          "Recurrent B Lymphoblastic Lymphoma"                                                    
 [29] "Alveolar Rhabdomyosarcoma"                                                              "Dysembryoplastic neuroepithelial tumor (DNET)"                                         
 [31] "Thrombocytopenia"                                                                       "B Lymphoblastic Leukemia/Lymphoma with t(12;21)(p13.2;q22.1); ETV6-RUNX1"              
 [33] "Langerhans Cell histiocytosis"                                                          "Wilms Tumor"                                                                           
 [35] "Hodgkin's Disease, Nodular Sclerosis, Lymphocyte Predominance"                          "Noonan Syndrome"                                                                       
 [37] "Ewings Sarcoma"                                                                         "Angiomatoid Fibrous Histiocytoma"                                                      
 [39] "Osteosarcoma"                                                                           "Neuroblastoma of the Adrenal Gland and Sympathetic Nervous System"                     
 [41] "Acute Lymphoblastic Leukemia in Remission"                                              "Glial-neuronal tumor NOS"                                                              
 [43] "T Acute Lymphoblastic Leukemia"                                                         "Mixed Germ Cell Tumor"                                                                 
 [45] "High Risk Neuroblastoma"                                                                "Sarcoma"                                                                               
 [47] "Thyroid Gland Follicular Carcinoma"                                                     "Metastatic secondary tumors"                                                           
 [49] "Malignant peripheral nerve sheath tumor (MPNST);Neurofibroma/Plexiform"                 "Ewing Sarcoma"                                                                         
 [51] "Anaplastic Large Cell Lymphoma"                                                         "B Lymphoblastic Leukemia/Lymphoma with Hyperdiploidy"                                  
 [53] "Dysgerminoma"                                                                           "Atypical Vascular Lesion"                                                              
 [55] "Pineoblastoma"                                                                          "Chronic Phase Chronic Myelogenous Leukemia, BCR-ABL1 Positive"                         
 [57] "Hamartoma"                                                                              "Dysplasia/Gliosis;Low-grade glioma/astrocytoma (WHO grade I/II)"                       
 [59] "Angiosarcoma"                                                                           "Adenocarcinoma"                                                                        
 [61] "Neutropenia"                                                                            "Sacrococcygeal Teratoma"                                                               
 [63] "Supratentorial or Spinal Cord PNET"                                                     "Neurocytoma"                                                                           
 [65] "Acute Lymphoblastic Leukemia"                                                           "Myeloid Neoplasm"                                                                      
 [67] "Undifferentiated (Embryonal) Sarcoma"                                                   "Ganglioglioma;Other"                                                                   
 [69] "Recurrent Acute Myeloid Leukemia"                                                       "Schwannoma"                                                                            
 [71] "Inflammatory Myofibroblastic Tumor"                                                     "Mixed Phenotype Acute Leukemia, B/Myeloid, Not Otherwise Specified"                    
 [73] "Mature B-Cell Non-Hodgkin Lymphoma"                                                     "Hemangioblastoma"                                                                      
 [75] "Therapy-Related Acute Myeloid Leukemia"                                                 "Solid Pseudopapillary Neoplasm of the Pancreas"                                        
 [77] "Plexiform Neurofibroma"                                                                 "High Grade Sarcoma"                                                                    
 [79] "Chordoma"                                                                               "B Lymphoblastic Lymphoma"                                                              
 [81] "Metastatic Thyroid Gland Papillary Carcinoma"                                           "Shwachman-Diamond Syndrome"                                                            
 [83] "Mixed Phenotype Acute Leukemia, T/Myeloid, Not Otherwise Specified"                     "Cyst"                                                                                  
 [85] "Recurrent Acute Lymphoblastic Leukemia"                                                 "Metastatic Melanoma"                                                                   
 [87] "Neurofibroma/Plexiform"                                                                 "Oligodendroglioma"                                                                     
 [89] "Typical Acute Promyelocytic Leukemia"                                                   "B-Lymphoblastic Leukemia/Lymphoma with Intrachromosomal Amplification of Chromosome 21"
 [91] "Acute Myeloid Leukemia in Remission"                                                    "Infantile Fibrosarcoma"                                                                
 [93] "Thrombocytosis"                                                                         "Langerhans Cell Histiocytosis"                                                         
 [95] "T Lymphoblastic Lymphoma"                                                               "Acute Megakaryoblastic Leukemia"                                                       
 [97] "Immature Ovarian Teratoma"                                                              "Teratoma"                                                                              
 [99] "Sinus Histiocytosis with Massive Lymphadenopathy"                                       "Aplastic Anemia"                                                                       
[101] "Desmoid Fibromatosis"                                                                   "Stage IV High Grade Burkitt-Like Lymphoma"                                             
[103] "Atypical Teratoid/Rhabdoid Tumor"                                                       "Subependymal Giant Cell Astrocytoma (SEGA)"                                            
[105] "Recurrent Neuroblastoma"                                                                "Hepatoblastoma"                                                                        
[107] "Non-Cancer Diagnosis"                                                                   "Juvenile Xanthogranuloma"                                                              
[109] "Bone Marrow Cellularity, CTCAE"                                                         "Burkitt Lymphoma"                                                                      
[111] "Dysembryoplastic Neuroepithelial Tumor"                                                 "Follicular Variant Thyroid Gland Papillary Carcinoma"                                  
[113] "Hepatocellular Carcinoma"                                                               "Ganglioneuroblastoma"                                                                  
[115] "Hypocellular Bone Marrow"                                                               "Adrenal Gland Neuroblastoma"                                                           
[117] "Type II Pleuropulmonary Blastoma"                                                       "Chronic Myeloid Leukemia Pathway"                                                      
[119] "Burkitt-Like Lymphoma with 11q Aberration"                                              "Malignant Peripheral Nerve Sheath Tumor"                                               
[121] "Sinonasal Adenocarcinoma"                                                               "Intrahepatic Cholangiocarcinoma"                                                       
[123] "Metastatic secondary tumors;Other"                                                      "Vascular Malformation"                                                                 
[125] "Hepatoblastoma with Combined Fetal and Embryonal Epithelial Differentiation"            "Neuroblastic Tumor"                                                                    
[127] "Germinoma;Other;Teratoma"                                                               "Adrenal Mass"                                                                          
[129] "Myelodysplastic Syndrome"                                                               "Burkitt Leukemia"                                                                      
[131] "Germinoma"                                                                              "Sclerosing Rhabdomyosarcoma"                                                           
[133] "CIC-DUX4 Sarcoma"                                                                       "Paraganglioma"                                                                         
[135] "Retinoblastoma"                                                                         "Lymphadenitis"                                                                         
[137] "Mature T-Cell and NK-Cell Non-Hodgkin Lymphoma"                                         "Glioblastoma Multiforme Pathway"                                                       
[139] "Pleuropulmonary Blastoma"                                                               "Acute Myeloid Leukemia with Gene Mutations"                                            
[141] "Germinoma;Teratoma"                                                                     "Primary Myelofibrosis"                                                                 
[143] "Diffuse Midline Glioma, H3 K27M-Mutant"                                                 "Klippel-Trenaunay-Weber Syndrome"                                                      
[145] "Nodular Fasciitis"                                                                      "Conventional Osteosarcoma"                                                             
[147] "B Lymphoblastic Leukemia/Lymphoma, Not Otherwise Specified"                             "Ovarian Sertoli-Leydig Cell Tumor"                                                     
[149] "Abdominal Inflammatory Myofibroblastic Tumor"                                           "Low Grade Glioma"                                                                      
[151] "Undifferentiated Malignant Neoplasm"                                                    "Dysplasia/Gliosis"                                                                     
[153] "NULL"                                                                                   "Spinocerebellar Ataxia Type 7"                                                         
[155] "Pre-B Acute Lymphoblastic Leukemia"                                                     "Neuroendocrine Carcinoma"                                                              
[157] "Bone Marrow Failure"                                                                    "Spindle Cell Neoplasm"                                                                 
[159] "Germ Cell Tumor"                                                                        "Benign Vascular Neoplasm"                                                              
[161] "Histiocytosis"                                                                          "Retinal Neuroblastoma"                                                                 
> 

Cc @aadamk, @jharenza, @afarrel, @runjin326, @logstar

runjin326 commented 2 years ago

@ewafula , all the three fields that you mentioned should come from molecular subtyping. Basically, the information that we get from DW is a base file and we add to that with additional information from molecular subtyping information. I will take a look into this and give you a detailed instruction as to how to go forward about this.

ewafula commented 2 years ago

@ewafula , all the three fields that you mentioned should come from molecular subtyping. Basically, the information that we get from DW is a base file and we add to that with additional information from molecular subtyping information. I will take a look into this and give you a detailed instruction as to how to go forward about this

Ok. Thanks, @runjin326

jharenza commented 2 years ago

@ewafula and @runjin326 - it looks like we will want to do some backend work to update the pathology_diagnosis for this cohort. @aadamk are you able to work with ADAPT/CRU on doing with DGD what we did with PNOC. That is, we moved the initial pathology_diagnosis to pathology_free_text_diagnosis and harmonized pathology_diagnosis to be current with CBTN available selections for brain tumors? It looks like some of them are consistent, however, multiple versions of ATRT jumped out at me.

For the non-CNS diagnoses, we may have to go through and manually assign cancer_group. @runjin326 if you can pull these out into a google doc and want to give it a shot, I can review. Then, we can use that as sort of a "mapping file" for adding cancer_group. I anticipate this will happen with nearly every pan-cancer cohort going forward, so we should think about a way to tackle this during histology file generation. Perhaps we should create one mapping file (can convert from a google doc/TSV, which we will continually add onto, to a JSON file) of pathology_diagnosis or harmonized_diagnosis to cancer_group to do this, rather than the logic in the code? Thoughts?

aadamk commented 2 years ago

I think it makes sense for us to continue using DGD as the cohort, as the cohort is specific to the sequencing strategy and the data source.

I agree - the consent protocol can be internal knowledge that we retain in the warehouse as a source of truth.

While we are thinking about this, consider we also have GMKF NBL tumor-only WXS for the same tumors in which we have T/N WGS. The idea for this design was that WGS calls would validate WXS calls which have deeper coverage. We have a somewhat similar but different design for some PNOC (WGS and WXS), and in the independent specimen list, we preferentially take WXS calls for mutation frequency tables. We could preferentially take DGD panel calls as a higher confidence set than WGS, but then we lose any calls not within the panel BED regions.

This is a difficult one - I'm not sure as to the optimal approach, but one idea that comes to mind if we do not want to lose WGS calls would be to include mutations occurring in regions complementary to that of the panel's, requiring a more stringent threshold for VAF and allelic depth.

aadamk commented 2 years ago

@aadamk are you able to work with ADAPT/CRU on doing with DGD what we did with PNOC. That is, we moved the initial pathology_diagnosis to pathology_free_text_diagnosis and harmonized pathology_diagnosis to be current with CBTN available selections for brain tumors?

Yes, I should be able to work with them on the above.

runjin326 commented 2 years ago

@jharenza, I have now generated a brief mapping from pathology diagnosis to cancer group + broad histology - attached here as a excel file.

cg_bh_path_match.xlsx

I also noticed in the tumor_descriptor field, in addition to terms that we currently have (i.e., Initial CNS Tumor, Primary Tumor, Recurrence, Progressive and Progressive Disease Post Mortem), it also has the following additional terms:

Initial Tumor/Cancer Diagnosis
Recurrence/Relapse
Second Malignancy
Progressive;Recurrence

Do we want to harmonize them to what we currently have? Or do we want to leave them as is and then modify the independent samples module to account for those?

jharenza commented 2 years ago

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/pull/188