AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Metadata mismatches (age, tumor descriptor, diagnosis) between download and CAVATICA #1614

Closed jaclyn-taroni closed 1 year ago

jaclyn-taroni commented 2 years ago

Reported by @GeoffLyle via email, quoting relevant parts here:

[We] found an inconsistency in our records regarding patient age (PT_CSZ9QA1N / C15498). We then looked further into the metadata reported in the pbta-histologies.tsv file (from commit: e03222f99311e0231c993fbc36fa344bdcb6b75a) and some mismatches in what was reported there vs. on Cavatica.

We believe these are the same samples in both databases as they match on sample id, Kids First Biospecimen ID, Kids First Participant ID as well as on demographics like gender, race and ethnicity. We found some mismatches on diagnosis, tumor descriptor and age. We would like to use this metadata in our analyses and were hoping you could clear up which data we should use.

I've attached a .xlsx file containing 15 RNA-Seq samples highlighting where the metadata does not match (first mismatch is in cell J2).

Here is the xlsx file: Cavatica_OpenPBTAhistologies_metadata_comparison_Oct2022.xlsx

Geoff notes that only these 15 samples were reviewed, so this may not capture all potential mismatches.

jharenza commented 2 years ago

Hi Geoff,

I did a check of our histologies files on CAVATICA (v22), GitHub repository (current commit mentioned above), and data download (v22) and all md5sums match. If the data to which you are comparing to the current histologies file is derived from an earlier iteration of a histologies file, then you should use the current (v22) release. Alternatively, if you are interested in continual updates to the histologies information, you can head over to OpenPedCan (https://github.com/PediatricOpenTargets/OpenPedCan-analysis), in which we are releasing additional PBTA and pan-cancer data in the same fashion. The histologies file will be updated with every release and the differences you mention can be expected, as the input from ~32 CBTN sites does change from time to time. This can occur due to updating mistakes or updating survival data or adding/dropping data.

Please let me know if you have any additional clinical data related questions!

GeoffLyle commented 1 year ago

Hello @jharenza,

Thank you for your response! I looked at the histologies.tsv file from OpenPedCan, and the metadata matches what is found in the pbta-histologies.tsv file from OpenPBTA. However, what I am seeing in CAVATICA in terms of metadata associated with the .bam file does not match.

Ex. For BS_6DCSD5Y6 it is reported that the 'age at diagnosis' is 2724 in pbta-histologies.tsv and histologies.tsv, however on CAVATICA it is shown as 2955 (see attached image).

Screen Shot 2022-10-20 at 5 26 00 PM

jharenza commented 1 year ago

Ahhh - BAM files! I am not sure at what interval that clinical data is patched, if at all regularly, but can loop in @zhangb1 here who does that - I imagine if he does a patch tomorrow, for example, it will change. Was this a project we had set up for you, or is this within the CBTN project on CAVATICA, or was this a situation in which you got files from Kids First and then that portal's data was ported over? Just trying to figure out the source of the data/ timing and whether or not it gets updated at all and if we can have a strategy for updates in CAVATICA/KidsFirst. Thanks!

GeoffLyle commented 1 year ago

Thanks @jharenza. I talked with our data coordinator and these BAM files are from within the CBTN project on CAVATICA. Is it possible there are two different age metadata fields and CAVATICA and OpenPBTA are both reporting them as "age_at_diagnosis"?

jharenza commented 1 year ago

Thanks @jharenza. I talked with our data coordinator and these BAM files are from within the CBTN project on CAVATICA. Is it possible there are two different age metadata fields and CAVATICA and OpenPBTA are both reporting them as "age_at_diagnosis"?

Hi! Yes, that is what is happening - there are a few different fields and they are named differently across the multiple systems, so I am working with @baileyckelly to get the available fields we can pull into CAVATICA from Kids First to match the correct field to age_at_diagnosis. The age_at_diagnosis in the histologies file is the current/correct age and what is in CAVATICA looks to be the age at latest clinical update. Thanks for your pateince.

GeoffLyle commented 1 year ago

Thank you for looking into this! Do you think the mismatch for the primary site of BS_9CA93S6D, which has the primary_site value of "Not Reported" in CAVATICA and "Brain Stem- Pons" in OpenPBTA is because this field has not been updated in CAVATICA yet? Also the tumor_descriptor for BS_SD8YBRBR is "Initial CNS Tumor" in OpenPBTA and "Progressive" in CAVATICA (another mismatch in tumor_descriptor seen in BS_W4DB5RP1). I'm not sure whether this is an entry error or whether there are multiple samples for the same patient and metadata is being copied over the fields.

jharenza commented 1 year ago

Hi @GeoffLyle - you should use what is in OpenPBTA, it is more up-to-date. I am not sure when CAVATICA was last updated, but it is out of date with our current databases. I can talk to our team to have that updated at some interval, but as of now, it was more of a one-time clinical patch.

GeoffLyle commented 1 year ago

@jharenza Great! My main concern was finding out which data source was most up-to-date. I will use the metadata from OpenPBTA for our analyses. That answers all the questions on my end!

jaclyn-taroni commented 1 year ago

I'm going to go ahead and close this issue in that case. Thanks for your help @jharenza! @GeoffLyle please post any additional questions here and I'll re-open as needed.