AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

`age_at_diagnosis_days` column contained different information prior to v6 #260

Open jaclyn-taroni opened 5 years ago

jaclyn-taroni commented 5 years ago

What data file(s) does this issue pertain to?

pbta-histologies.tsv

What release are you using?

I am comparing release-v5-20190924 to the current release, release-v9-20191105.

Put a link to the relevant section of the OpenPBTA-manuscript here.

There is nothing in the data harmonization section about this specifically: https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#clinical-data-harmonization yet, but it is related to this open PR: https://github.com/AlexsLemonade/OpenPBTA-manuscript/pull/55/files#diff-1958beef0777ea964ffe394a82c98903R163

Put your question or report your issue here.

For context, the independent-samples module was developed using release-v5-20190924. @jashapiro used the age_at_diagnosis_days column to identify the earliest samples:

https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/9c68671a4cba89681046c026eb1658794165e836/analyses/independent-samples/independent-samples.R#L48

I was looking to use a similar strategy as part of the participant-level merging in the Oncoprint pipeline (#243), but the information content of this column has changed. My intuition is that it used to contain something akin to age at timepoint and now it contains, as it's name suggests the age at diagnosis.

In the v9 histologies file, there are no instances of multiple ages being associated with the same participant ID:

> v9_histologies_df <- readr::read_tsv("data/release-v9-20191105/pbta-histologies.tsv")
> v9_ages <- v9_histologies_df %>% 
    group_by(Kids_First_Participant_ID) %>% 
    summarize(ages = paste(sort(unique(age_at_diagnosis_days)), collapse = ", "))
> sum(grepl(",", v9_ages$ages))
[1] 0

In v5, that was not the case:

> v5_histologies_df <- readr::read_tsv("data/release-v5-20190924/pbta-histologies.tsv")
> v5_ages <- v5_histologies_df %>% 
    group_by(Kids_First_Participant_ID) %>% 
    summarize(ages = paste(sort(unique(age_at_diagnosis_days)), collapse = ", "))
> sum(grepl(",", v5_ages$ages))
[1] 86

This looks like it changed between v5 and v6:

> v6_histologies_df <- readr::read_tsv("data/release-v6-20191030/pbta-histologies.tsv")
> v6_ages <- v6_histologies_df %>% 
    group_by(Kids_First_Participant_ID) %>% 
    summarize(ages = paste(sort(unique(age_at_diagnosis_days)), collapse = ", "))
> sum(grepl(",", v6_ages$ages))
[1] 0
jaclyn-taroni commented 5 years ago

That is noted here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/release-notes.md#release-v6-20191030

Updated age_at_diagnosis to earliest age reported (same age used in OS calculations)

I think we do want some notion of the order in which different samples from the same patient were taken, but it doesn't need to be tied to age at all.

jharenza commented 5 years ago

Hi @jaclyn-taroni - yes, this changed with the update because I noticed some discrepancies in the two databases we used, so I defaulted to use the age which was used to calculate OS, which is in one database away from kidsfirst, from which the data was first pulled. :(

jharenza commented 5 years ago

That is noted here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/release-notes.md#release-v6-20191030

Updated age_at_diagnosis to earliest age reported (same age used in OS calculations)

I think we do want some notion of the order in which different samples from the same patient were taken, but it doesn't need to be tied to age at all.

Re: this, we don't have dates, and that would be PII anyway, so I think maybe the thing you can do is go by the SDG ID, which is sample and aliquot ID - @allisonheath do you know if these are always sequential for patients?

jaclyn-taroni commented 5 years ago

Is it possible to have some field that indicates the order of samples instead?

jaclyn-taroni commented 5 years ago

Not looking for dates, looking for rank order (e.g., 1, 2, 3).

jharenza commented 5 years ago

@allisonheath - do we have that information? Typically this would be denoted by phase of therapy (initial cns tumor, progression, etc). I think we only recently talked about adding event to the clinical information. This is currently being brainstormed as a 6-month - 1 year(?) or so overhaul with @allisonheath's team.

jaclyn-taroni commented 5 years ago

I think an issue with using tumor_descriptor alone is that there are instances where a combination of Kids_First_Participant_ID and tumor_descriptor is not unique:

> v9_histologies_df %>%
+     filter(sample_type == "Tumor",
+            composition == "Solid Tissue",
+            experimental_strategy != "RNA-Seq") %>% 
+     group_by(Kids_First_Participant_ID, tumor_descriptor) %>% 
+     tally() %>% 
+     arrange(desc(n))
# A tibble: 876 x 3
# Groups:   Kids_First_Participant_ID [813]
   Kids_First_Participant_ID tumor_descriptor                    n
   <chr>                     <chr>                           <int>
 1 PT_KZ56XHJT               Progressive Disease Post-Mortem     6
 2 PT_2WVW55DA               Progressive                         4
 3 PT_KBFM551M               Initial CNS Tumor                   4
 4 PT_KTRJ8TFY               Progressive Disease Post-Mortem     4
 5 PT_MNSEJCDM               Initial CNS Tumor                   4
 6 PT_1H2REHT2               Progressive                         3
 7 PT_7JQ24F35               Unavailable                         3
 8 PT_9GKVQ9QS               Initial CNS Tumor                   3
 9 PT_K8ZV7APT               Initial CNS Tumor                   3
10 PT_KTRJ8TFY               Initial CNS Tumor                   3
jaclyn-taroni commented 5 years ago

Okay - so sample_id can be used for the specific participant-level mapping in the Oncoprint stuff that put this issue in motion:

> dup_sample_ids <- histologies_df$sample_id[which(duplicated(histologies_df$sample_id))]
> histologies_df %>% filter(sample_id %in% dup_sample_ids[1:5]) %>% arrange(sample_id)
# A tibble: 10 x 33
   Kids_First_Bios… sample_id aliquot_id Kids_First_Part… experimental_st… sample_type composition tumor_descriptor primary_site reported_gender race  ethnicity age_at_diagnosi… disease_type_old disease_type_new
   <chr>            <chr>     <chr>      <chr>            <chr>            <chr>       <chr>       <chr>            <chr>        <chr>           <chr> <chr>     <chr>            <chr>            <chr>           
 1 BS_03SSN1N2      7316-1744 711115     PT_3KK1F95W      WGS              Tumor       Solid Tiss… Initial CNS Tum… Temporal Lo… Female          White Not Hisp… 4633             Low-grade gliom… Low-grade gliom…
 2 BS_0X9EGHY2      7316-1744 717174     PT_3KK1F95W      RNA-Seq          Tumor       Solid Tiss… Initial CNS Tum… Temporal Lo… Female          White Not Hisp… 4633             Low-grade gliom… Low-grade gliom…
 3 BS_0DKPGQWD      7316-183  711426     PT_BQ8BQ01J      WGS              Tumor       Solid Tiss… Initial CNS Tum… Frontal Lobe Male            White Not Hisp… 8422             Low-grade gliom… Low-grade gliom…
 4 BS_0T17SY47      7316-183  717156     PT_BQ8BQ01J      RNA-Seq          Tumor       Solid Tiss… Initial CNS Tum… Frontal Lobe Male            White Not Hisp… 8422             Low-grade gliom… Low-grade gliom…
 5 BS_0ZR4XA69      7316-1855 711372     PT_7VSY72EK      WGS              Tumor       Solid Tiss… Initial CNS Tum… Skull;Tempo… Male            White Not Hisp… 2929             Primary CNS lym… Primary CNS lym…
 6 BS_1607397Q      7316-1855 711731     PT_7VSY72EK      RNA-Seq          Tumor       Solid Tiss… Initial CNS Tum… Skull;Tempo… Male            White Not Hisp… 2929             Primary CNS lym… Primary CNS lym…
 7 BS_05S9WJW6      7316-2659 711416     PT_23NZGSRJ      WGS              Tumor       Solid Tiss… Initial CNS Tum… Cerebellum/… Female          White Not Hisp… 1484             Medulloblastoma  Medulloblastoma 
 8 BS_1BWP5MCT      7316-2659 717140     PT_23NZGSRJ      RNA-Seq          Tumor       Solid Tiss… Initial CNS Tum… Cerebellum/… Female          White Not Hisp… 1484             Medulloblastoma  Medulloblastoma 
 9 BS_12ZB7R6A      7316-41   478601     PT_CW0BJE0Y      WGS              Tumor       Solid Tiss… Initial CNS Tum… Cerebellum/… Male            Repo… Not Repo… 883              Atypical Terato… Atypical Terato…
10 BS_15ETQ0E4      7316-41   549586     PT_CW0BJE0Y      RNA-Seq          Tumor       Solid Tiss… Initial CNS Tum… Cerebellum/… Male            Repo… Not Repo… 883              Atypical Terato… Atypical Terato…
# … with 18 more variables: short_histology <chr>, broad_histology <chr>, broad_composition <chr>, Notes <chr>, germline_sex_estimate <chr>, RNA_library <chr>, OS_days <dbl>, OS_status <chr>, cohort <chr>,
#   age_last_update_days <dbl>, source_text_tumor_descriptor <chr>, cancer_predispositions <chr>, seq_center <chr>, normal_fraction <dbl>, tumor_fraction <dbl>, glioma_brain_region <chr>, tumor_ploidy <dbl>,
#   molecular_subtype <chr>
jharenza commented 5 years ago

Hmm, this is tough because post-mortem samples were all table at autopsy, which only occurs once, so we will have to just choose one.

sample_id as far as I know it, is really an event ID, but what I am not sure of is if one event ID gets retired, does that get reused and can there be an earlier event ID that is really a later event ID. Allison would know that.

jharenza commented 4 years ago

Hmm, this is tough because post-mortem samples were all table at autopsy, which only occurs once, so we will have to just choose one.

sample_id as far as I know it, is really an event ID, but what I am not sure of is if one event ID gets retired, does that get reused and can there be an earlier event ID that is really a later event ID. Allison would know that.

@allisonheath - do you have a rank order for specimens?