PediatricOpenTargets / OpenPedCan-api

2 stars 7 forks source link

Update TPM boxplots to show relapse tumors side-by-side with primary tumors #51

Closed logstar closed 2 years ago

logstar commented 2 years ago

Currently, /tpm/gene-disease-gtex and /tpm/gene-all-cancer boxplots only show primary tumors.

As suggested by @taylordm , the TPM boxplots could be updated to show relapse tumors side-by-side with primary tumors. Primary tumors and relapse tumors could be plotted with different box colors, and a legend could be added in the boxplot to annotate different colors.

cc @taylordm @chinwallaa @afarrel

DBHI-BiG commented 2 years ago

Alternatively this could be offered an expanded plot in the API with the choice to include relapsed tumors with primary tumors together, as a second plot.

logstar commented 2 years ago

Alternatively this could be offered an expanded plot in the API with the choice to include relapsed tumors with primary tumors together, as a second plot.

I agree. Maybe we could add an additional parameter for this purpose. For example, a required relapseTumors parameter could take one of the following values:

The parameter name and values above could be a different in real development, to be more informative and flexible.

DBHI-BiG commented 2 years ago

This is a great idea!

logstar commented 2 years ago

Following are some plans for implementing this feature. I will start working on the plan to keep using the same table for adding relapse tumors, because this plan may overall take less effort to implement.

The implementation plan is open to other suggestions.

logstar commented 2 years ago

@taylordm @chinwallaa @afarrel - I encountered a technical issue while designing the API and R analysis function interfaces. The issue is that certain cancer groups currently only have primary or relapse tumors. Such cancer groups are much fewer after requiring >= 3 samples per cancer group, but there will be more samples and refined cancer groups added into OpenPedCan-analysis. Specific numbers of v10 RNA-seq independent primary and relapse samples are listed at the bottom.

I will continue working with the following options. These options could be changed to any other options and are open to any suggestions. I will be doing database development for the next couple of days, so changing API and R function interface designs will not set back any progress.

The issue has a couple of implications on the API interface change, which currently is to add a relapseTumors URL query parameter as described in https://github.com/PediatricOpenTargets/OpenPedCan-api/issues/51#issuecomment-942623177.

The issue has one implication on PedOT default boxplot. If API interface returns HTTP error code, PedOT could query primary-only and primary-and-relapse and use any available plot with certain priority; if there is no plot available, PedOT could grey out the expression boxplot widget.

According to the API interface design, the R analysis functions need to properly handle the box, color, legend, and x-labels of cancer groups with only primary or relapse tumors. These handling procedures would be nested dependent if-else statements, so unit testing would be very helpful to reduce or eliminate code error.

With the issue and implications above, I was wondering if we should redesign the additional relapse API parameter as following. Add a includeTumorDesc parameter, indicating the API response need to include certain tumor descriptors, with the following choices:

Parameter value Parameter value description Legend X-labels Color
primary-only Only show primary tumors. No legend. Tumor tissue x-labels are {cancer_group} primary tumors (Dataset = {cohort}, N = {n_samples}); GTEx tissue x-labels are {GTEx_tissue_subgroup} (Dataset = GTEx, N = {n_samples}). Colors are red and grey for cancer and GTEx tissues respectively.
relapse-only Only show relapse tumors. No legend. Tumor tissue x-labels are {cancer_group} relapse tumors (Dataset = {cohort}, N = {n_samples}); GTEx tissue x-labels are {GTEx_tissue_subgroup} (Dataset = GTEx, N = {n_samples}). Colors are red and grey for cancer and GTEx tissues respectively.
primary-and-relapse-same-box Show primary and relapse tumors in the same box. No legend. Tumor tissue x-labels are {cancer_group} {unique_tumor_descriptors} tumors (Dataset = {cohort}, N = {n_samples}); GTEx tissue x-labels are {GTEx_tissue_subgroup} (Dataset = GTEx, N = {n_samples}). If there is only primary or relapse tumors, and API does not return HTTP error code, {unique_tumor_descriptors} is the unique tumor descriptors of the samples in the corresponding box, primary and/or relapse. Colors are red and grey for cancer and GTEx tissues respectively.
primary-and-relapse-separate-boxes Show primary and relapse tumors in separate boxes side-by-side. Color legend for primary, relapse, and GTEx samples. Tumor tissue x-labels are {cancer_group} {unique_tumor_descriptor} tumors (Dataset = {cohort}, N = {n_samples}); GTEx tissue x-labels are {GTEx_tissue_subgroup} (Dataset = GTEx, N = {n_samples}). If there is only primary or relapse tumors, and API does not return HTTP error code, {unique_tumor_descriptor} is the unique tumor descriptor of the samples in the corresponding box, primary or relapse. Colors are red, dark red, and grey for primary tumor, relapse tumor, and GTEx tissues respectively.

Cancer group tumor sample number table:

cancer_group n_independent_primary_tumors_each_cohort n_independent_relapse_tumors_each_cohort n_independent_primary_tumors_all_cohorts n_independent_relapse_tumors_all_cohorts
Acute Lymphoblastic Leukemia 458 73 458 73
Acute Myeloid Leukemia 141 40 141 40
Adenoma 3 0 3 0
Atypical Teratoid Rhabdoid Tumor 24 6 24 6
Cavernoma 1 0 1 0
Chordoma 3 1 3 1
Choroid plexus carcinoma 4 0 4 0
Choroid plexus cyst 1 0 1 0
Choroid plexus papilloma 14 0 14 0
Clear cell sarcoma of the kidney 13 0 13 0
CNS Embryonal tumor 8 4 8 4
CNS neuroblastoma 1 1 1 1
Craniopharyngioma 28 7 28 7
Diffuse intrinsic pontine glioma 4 6 4 6
Diffuse midline glioma 59 11 59 11
Dysembryoplastic neuroepithelial tumor 19 5 19 5
Embryonal tumor with multilayer rosettes 4 3 4 3
Ependymoma 69 19 69 19
Ewing sarcoma 8 1 8 1
Ganglioglioma 36 9 36 8
Ganglioneuroblastoma 1 1 1 1
Ganglioneuroma 1 0 1 0
Germinoma 4 0 4 0
Germinoma;Teratoma 0 1 0 1
Glial-neuronal tumor NOS 6 3 6 3
Hemangioblastoma 2 1 2 1
High-grade glioma/astrocytoma 79 33 79 32
Langerhans Cell histiocytosis 4 0 4 0
Low-grade glioma/astrocytoma 197 37 197 37
Malignant peripheral nerve sheath tumor 3 1 2 1
Medulloblastoma 105 13 105 13
Melanocytic tumor 1 0 1 0
Meningioma 14 7 14 7
Metastatic secondary tumors 3 1 3 1
Metastatic secondary tumors;Neuroblastoma 0 3 0 3
Myofibroblastoma 0 1 0 1
Myxoid spindle cell tumor 1 0 1 0
Neuroblastoma 358 10 347 10
Neurofibroma/Plexiform 10 5 9 5
Oligodendroglioma 1 1 1 1
Osteosarcoma 87 0 87 0
Pineoblastoma 4 0 4 0
Rhabdoid tumor 64 0 64 0
Rhabdomyosarcoma 2 0 2 0
Rosai-Dorfman disease 1 0 1 0
Sarcoma 3 2 3 2
Schwannoma 14 2 13 2
Subependymal Giant Cell Astrocytoma 3 1 3 1
Teratoma 4 3 4 3
Wilms tumor 124 5 124 5

R script for generating the sample number table:

library(tidyverse)

hdf <- read_tsv('OpenPedCan-analysis/data/histologies.tsv', guess_max = 1e6)

tpm_df <- readRDS(
  'OpenPedCan-analysis/data/gene-expression-rsem-tpm-collapsed.rds')

em_df <- read_tsv('OpenPedCan-analysis/data/efo-mondo-map.tsv')

isdf_list <- list(
  each_cohort = list(
    primary = read_tsv(
      'OpenPedCan-analysis/data/independent-specimens.rnaseq.primary.eachcohort.tsv'),

    relapse = read_tsv(
      'OpenPedCan-analysis/data/independent-specimens.rnaseq.relapse.eachcohort.tsv')
  ),

  all_cohorts = list(
    primary = read_tsv(
      'OpenPedCan-analysis/data/independent-specimens.rnaseq.primary.tsv'),

    relapse = read_tsv(
      'OpenPedCan-analysis/data/independent-specimens.rnaseq.relapse.tsv')
  )
)

npr_list <- map(isdf_list, function(xl) {
  res_l <- imap(xl, function(xdf, xname) {
    xdf <- xdf %>%
      filter(
        Kids_First_Biospecimen_ID %in% colnames(tpm_df),
        Kids_First_Biospecimen_ID %in% hdf$Kids_First_Biospecimen_ID)

    stopifnot(identical(
      length(unique(hdf$Kids_First_Biospecimen_ID)),
      nrow(hdf)
    ))

    xdf <- xdf %>%
      left_join(select(hdf, Kids_First_Biospecimen_ID, cancer_group)) %>%
      filter(!is.na(.data$cancer_group)) %>%
      left_join(em_df) %>%
      filter(!is.na(.data$efo_code)) %>%
      mutate(type = .env$xname)

    return(xdf)
  })

  res_df <- reduce(res_l, bind_rows)

  res <- res_df %>%
    group_by(cancer_group) %>%
    summarise(
      n_independent_primary_tumors = sum(.data$type == 'primary'),
      n_independent_relapse_tumors = sum(.data$type == 'relapse'))

  return(res)
})

msf_npr_list <- map(npr_list, function(xdf) {
  xdf <- xdf %>%
    filter(.data$n_independent_primary_tumors + .data$n_independent_relapse_tumors >= 3)
})

npr_df <- full_join(
  npr_list$each_cohort, npr_list$all_cohorts, by = 'cancer_group',
  suffix = c('_each_cohort', '_all_cohorts'))

write_tsv(npr_df, '../eda_scripts/primary_relapse_n_biospecs.tsv')

write_tsv(npr_df, '../eda_scripts/primary_relapse_cg_n_ge3_n_biospecs.tsv')
logstar commented 2 years ago

@taylordm @chinwallaa @afarrel - The database is updated, and I have been developing the API code according to the design in my last comment. The API code may require one or two weeks to develop, which is estimated based on the following complexity analysis.

All explicit parameter combinations are the following Cartesian product, which has 64 combinations.

{one EFO ID, all EFO IDs} x
  {zero GTEx tissues, all GTEx tissues} x
  {standard plot width, double plot width} x
  {collapse GTEx tissues, expand GTEx tissues} x
  {primary only, relapse only,
   primary and relapse in the same box,
   primary and relapse in different boxes side by side}

API code control flows also need to implicitly handle the following behaviors, according to any explicit parameter combination.

I also suggest to start replacing the API with the following workflow, given that the API complexity will further expand after implementing #37 . If we eventually will replace the API, continuously developing the API will keep doubling the effort required to develop final PedOT platform.

logstar commented 2 years ago

@taylordm @chinwallaa @afarrel - Following are plots with primary-only, relapse-only, primary-and-relapse-in-same-box, and primary-and-relapse-in-different-boxes tumor samples. Primary and/or relapse are described in x-labels with "Specimen = {Pediatric Primary and/or Relapse Tumors}".

I will be working on developing a new systematic testing script to test all possible parameters for a couple of genes and EFO IDs. The current testing shell script, tests/curl_test_endpoints.sh, can no longer handle the number of API endpoints and parameters.

Let me know if you have questions or suggestions.

logstar commented 2 years ago

@taylordm @chinwallaa @afarrel - I have implemented the testing framework using R package testthat. Unit tests can also be implemented using this testing framework at later point if necessary.

I will be working on the following items to prepare a pull request for this issue:

Following are the output of test running. There are 288 API tests in total.

$ ./tests/run_tests.sh 
API base URL: http://localhost:8082
✔ |  OK F W S | Context
✔ | 288       | tests/r_test_scripts/test_endpoint_http.R [587.7 s]                                                                                                                                             

══ Results 
Duration: 587.7 s

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 288 ]
Done running run_tests.sh

The response time of each endpoint is summarized in the following boxplot. Both HTTP 200 success and 500 error response codes are both expected. HTTP 200 code means the requested data is available. HTTP 500 code means the requested data is not available, e.g. relapse tumors of Choroid plexus carcinoma. On localhost, JSON table endpoint response times are about 1-2 seconds, and PNG plot endpoint response times are about 2-4 seconds. On remote hosts, the response times will probably be 1-2 seconds slower, but I will test it to confirm.

Given that we expect a plot within 5 seconds, the performance of the R boxplot API may need to be optimized. I will submit an issue about this after testing remote hosts.

endpoint_response_time_boxplot

DBHI-BiG commented 2 years ago

Fantastic work. Thank you for the update.

From: Yuanchao Zhang @.> Reply-To: PediatricOpenTargets/OpenPedCan-api @.> Date: Wednesday, November 10, 2021 at 9:31 AM To: PediatricOpenTargets/OpenPedCan-api @.> Cc: "Taylor, Deanne M" @.>, Comment @.***> Subject: [External]Re: [PediatricOpenTargets/OpenPedCan-api] Update TPM boxplots to show relapse tumors side-by-side with primary tumors (#51)

@taylordmhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftaylordm&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917108340%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dYk%2BGk0BLKzgoafOkgQ2pvbhwN%2B37cBpMsBIcVz%2FwAI%3D&reserved=0 @chinwallaahttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchinwallaa&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917118294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=61YMXckQt6ir%2FwIeBdxNAVEMKX7jdczbZxX26R0Hqxo%3D&reserved=0 @afarrelhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fafarrel&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917118294%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lrL7lTif0y9dJSXzz92kF%2Fzy%2F6Tn8UPjltjBsEIk%2BQo%3D&reserved=0 - I have implemented the testing framework using R package testthathttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftestthat.r-lib.org%2F&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917128247%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=f4cusafkfVbbWp%2FOVJ24bqS6k69%2FE%2BKakRe5xI1dBIE%3D&reserved=0. Unit tests can also be implemented using this testing framework at later point if necessary.

Following are the output of test running. There are 288 API tests in total.

$ ./tests/run_tests.sh

API base URL: http://localhost:8082

✔ | OK F W S | Context

✔ | 288 | tests/r_test_scripts/test_endpoint_http.R [587.7 s]

══ Results

Duration: 587.7 s

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 288 ]

Done running run_tests.sh

The response time of each endpoint is summarized in the following boxplot. Both HTTP 200 success and 500 error response codes are both expected. HTTP 200 code means the requested data is available. HTTP 500 code means the requested data is not available, e.g. relapse tumors of Choroid plexus carcinoma. On localhost, JSON table endpoint response times are about 1-2 seconds, and PNG plot endpoint response times are about 2-4 seconds. On remote hosts, the response times will probably be 1-2 seconds slower, but I will test it to confirm.

Given that we expect a plot within 5 seconds, the performance of the R boxplot API may need to be optimized. I will submit an issue about this after testing remote hosts.

[endpoint_response_time_boxplot]https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F9595639%2F141129904-008dd4f4-fb98-400c-82af-f95555a076cf.png&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917128247%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3qA0tSW0%2FXwR0w2GrxosRM%2ByBKt7m%2FaqSjskP5DPNYo%3D&reserved=0

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPediatricOpenTargets%2FOpenPedCan-api%2Fissues%2F51%23issuecomment-965275705&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917138202%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=8igDCvTUewBGHu1aeNAlUoxNP%2FGbJJo8P9y5tsXGtRg%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABG2W5YMLNGTMKFZSGL3GR3ULJ64BANCNFSM5F52EHFQ&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917138202%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YnZtcz1Iargic3sJ6oETWbYR%2FqwKblv9DfgLWAFBJXM%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917148159%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=eBuIO6zG8FlrDR2RrF3xlEYz2GFYaizrfdyBZN1iFMM%3D&reserved=0 or Androidhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7CTAYLORDM%40chop.edu%7C7c1b8bef74384015a2bc08d9a456c924%7Ca611241607b041a59bb1d146b575c975%7C1%7C0%7C637721514917158124%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=zso5X3EgtcwSk2UWVnk%2BCSNJ%2BnOxeF724yuRa7PjaL0%3D&reserved=0. This email originated from an EXTERNAL sender to CHOP. Proceed with caution when replying, opening attachments, or clicking links. Do not disclose your CHOP credentials, employee information, or protected health information to a potential hacker.