Open jaclyn-taroni opened 5 years ago
That is noted here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/release-notes.md#release-v6-20191030
Updated
age_at_diagnosis
to earliest age reported (same age used in OS calculations)
I think we do want some notion of the order in which different samples from the same patient were taken, but it doesn't need to be tied to age at all.
Hi @jaclyn-taroni - yes, this changed with the update because I noticed some discrepancies in the two databases we used, so I defaulted to use the age which was used to calculate OS, which is in one database away from kidsfirst, from which the data was first pulled. :(
That is noted here: https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/master/doc/release-notes.md#release-v6-20191030
Updated
age_at_diagnosis
to earliest age reported (same age used in OS calculations)I think we do want some notion of the order in which different samples from the same patient were taken, but it doesn't need to be tied to age at all.
Re: this, we don't have dates, and that would be PII anyway, so I think maybe the thing you can do is go by the SDG ID, which is sample and aliquot ID - @allisonheath do you know if these are always sequential for patients?
Is it possible to have some field that indicates the order of samples instead?
Not looking for dates, looking for rank order (e.g., 1, 2, 3).
@allisonheath - do we have that information? Typically this would be denoted by phase of therapy (initial cns tumor, progression, etc). I think we only recently talked about adding event
to the clinical information. This is currently being brainstormed as a 6-month - 1 year(?) or so overhaul with @allisonheath's team.
I think an issue with using tumor_descriptor
alone is that there are instances where a combination of Kids_First_Participant_ID
and tumor_descriptor
is not unique:
> v9_histologies_df %>%
+ filter(sample_type == "Tumor",
+ composition == "Solid Tissue",
+ experimental_strategy != "RNA-Seq") %>%
+ group_by(Kids_First_Participant_ID, tumor_descriptor) %>%
+ tally() %>%
+ arrange(desc(n))
# A tibble: 876 x 3
# Groups: Kids_First_Participant_ID [813]
Kids_First_Participant_ID tumor_descriptor n
<chr> <chr> <int>
1 PT_KZ56XHJT Progressive Disease Post-Mortem 6
2 PT_2WVW55DA Progressive 4
3 PT_KBFM551M Initial CNS Tumor 4
4 PT_KTRJ8TFY Progressive Disease Post-Mortem 4
5 PT_MNSEJCDM Initial CNS Tumor 4
6 PT_1H2REHT2 Progressive 3
7 PT_7JQ24F35 Unavailable 3
8 PT_9GKVQ9QS Initial CNS Tumor 3
9 PT_K8ZV7APT Initial CNS Tumor 3
10 PT_KTRJ8TFY Initial CNS Tumor 3
Okay - so sample_id
can be used for the specific participant-level mapping in the Oncoprint stuff that put this issue in motion:
> dup_sample_ids <- histologies_df$sample_id[which(duplicated(histologies_df$sample_id))]
> histologies_df %>% filter(sample_id %in% dup_sample_ids[1:5]) %>% arrange(sample_id)
# A tibble: 10 x 33
Kids_First_Bios… sample_id aliquot_id Kids_First_Part… experimental_st… sample_type composition tumor_descriptor primary_site reported_gender race ethnicity age_at_diagnosi… disease_type_old disease_type_new
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 BS_03SSN1N2 7316-1744 711115 PT_3KK1F95W WGS Tumor Solid Tiss… Initial CNS Tum… Temporal Lo… Female White Not Hisp… 4633 Low-grade gliom… Low-grade gliom…
2 BS_0X9EGHY2 7316-1744 717174 PT_3KK1F95W RNA-Seq Tumor Solid Tiss… Initial CNS Tum… Temporal Lo… Female White Not Hisp… 4633 Low-grade gliom… Low-grade gliom…
3 BS_0DKPGQWD 7316-183 711426 PT_BQ8BQ01J WGS Tumor Solid Tiss… Initial CNS Tum… Frontal Lobe Male White Not Hisp… 8422 Low-grade gliom… Low-grade gliom…
4 BS_0T17SY47 7316-183 717156 PT_BQ8BQ01J RNA-Seq Tumor Solid Tiss… Initial CNS Tum… Frontal Lobe Male White Not Hisp… 8422 Low-grade gliom… Low-grade gliom…
5 BS_0ZR4XA69 7316-1855 711372 PT_7VSY72EK WGS Tumor Solid Tiss… Initial CNS Tum… Skull;Tempo… Male White Not Hisp… 2929 Primary CNS lym… Primary CNS lym…
6 BS_1607397Q 7316-1855 711731 PT_7VSY72EK RNA-Seq Tumor Solid Tiss… Initial CNS Tum… Skull;Tempo… Male White Not Hisp… 2929 Primary CNS lym… Primary CNS lym…
7 BS_05S9WJW6 7316-2659 711416 PT_23NZGSRJ WGS Tumor Solid Tiss… Initial CNS Tum… Cerebellum/… Female White Not Hisp… 1484 Medulloblastoma Medulloblastoma
8 BS_1BWP5MCT 7316-2659 717140 PT_23NZGSRJ RNA-Seq Tumor Solid Tiss… Initial CNS Tum… Cerebellum/… Female White Not Hisp… 1484 Medulloblastoma Medulloblastoma
9 BS_12ZB7R6A 7316-41 478601 PT_CW0BJE0Y WGS Tumor Solid Tiss… Initial CNS Tum… Cerebellum/… Male Repo… Not Repo… 883 Atypical Terato… Atypical Terato…
10 BS_15ETQ0E4 7316-41 549586 PT_CW0BJE0Y RNA-Seq Tumor Solid Tiss… Initial CNS Tum… Cerebellum/… Male Repo… Not Repo… 883 Atypical Terato… Atypical Terato…
# … with 18 more variables: short_histology <chr>, broad_histology <chr>, broad_composition <chr>, Notes <chr>, germline_sex_estimate <chr>, RNA_library <chr>, OS_days <dbl>, OS_status <chr>, cohort <chr>,
# age_last_update_days <dbl>, source_text_tumor_descriptor <chr>, cancer_predispositions <chr>, seq_center <chr>, normal_fraction <dbl>, tumor_fraction <dbl>, glioma_brain_region <chr>, tumor_ploidy <dbl>,
# molecular_subtype <chr>
Hmm, this is tough because post-mortem samples were all table at autopsy, which only occurs once, so we will have to just choose one.
sample_id
as far as I know it, is really an event
ID, but what I am not sure of is if one event ID gets retired, does that get reused and can there be an earlier event ID that is really a later event ID. Allison would know that.
Hmm, this is tough because post-mortem samples were all table at autopsy, which only occurs once, so we will have to just choose one.
sample_id
as far as I know it, is really anevent
ID, but what I am not sure of is if one event ID gets retired, does that get reused and can there be an earlier event ID that is really a later event ID. Allison would know that.
@allisonheath - do you have a rank order for specimens?
What data file(s) does this issue pertain to?
pbta-histologies.tsv
What release are you using?
I am comparing
release-v5-20190924
to the current release,release-v9-20191105
.Put a link to the relevant section of the OpenPBTA-manuscript here.
There is nothing in the data harmonization section about this specifically: https://github.com/AlexsLemonade/OpenPBTA-manuscript/blob/master/content/03.methods.md#clinical-data-harmonization yet, but it is related to this open PR: https://github.com/AlexsLemonade/OpenPBTA-manuscript/pull/55/files#diff-1958beef0777ea964ffe394a82c98903R163
Put your question or report your issue here.
For context, the
independent-samples
module was developed usingrelease-v5-20190924
. @jashapiro used theage_at_diagnosis_days
column to identify the earliest samples:https://github.com/AlexsLemonade/OpenPBTA-analysis/blob/9c68671a4cba89681046c026eb1658794165e836/analyses/independent-samples/independent-samples.R#L48
I was looking to use a similar strategy as part of the participant-level merging in the Oncoprint pipeline (#243), but the information content of this column has changed. My intuition is that it used to contain something akin to age at timepoint and now it contains, as it's name suggests the age at diagnosis.
In the v9 histologies file, there are no instances of multiple ages being associated with the same participant ID:
In v5, that was not the case:
This looks like it changed between v5 and v6: