Closed jaclyn-taroni closed 1 year ago
Hi Geoff,
I did a check of our histologies files on CAVATICA (v22), GitHub repository (current commit mentioned above), and data download (v22) and all md5sums match. If the data to which you are comparing to the current histologies file is derived from an earlier iteration of a histologies file, then you should use the current (v22) release. Alternatively, if you are interested in continual updates to the histologies information, you can head over to OpenPedCan (https://github.com/PediatricOpenTargets/OpenPedCan-analysis), in which we are releasing additional PBTA and pan-cancer data in the same fashion. The histologies file will be updated with every release and the differences you mention can be expected, as the input from ~32 CBTN sites does change from time to time. This can occur due to updating mistakes or updating survival data or adding/dropping data.
Please let me know if you have any additional clinical data related questions!
Hello @jharenza,
Thank you for your response! I looked at the histologies.tsv
file from OpenPedCan, and the metadata matches what is found in the pbta-histologies.tsv
file from OpenPBTA. However, what I am seeing in CAVATICA in terms of metadata associated with the .bam file does not match.
Ex. For BS_6DCSD5Y6 it is reported that the 'age at diagnosis' is 2724 in pbta-histologies.tsv
and histologies.tsv
, however on CAVATICA it is shown as 2955 (see attached image).
Ahhh - BAM files! I am not sure at what interval that clinical data is patched, if at all regularly, but can loop in @zhangb1 here who does that - I imagine if he does a patch tomorrow, for example, it will change. Was this a project we had set up for you, or is this within the CBTN project on CAVATICA, or was this a situation in which you got files from Kids First and then that portal's data was ported over? Just trying to figure out the source of the data/ timing and whether or not it gets updated at all and if we can have a strategy for updates in CAVATICA/KidsFirst. Thanks!
Thanks @jharenza. I talked with our data coordinator and these BAM files are from within the CBTN project on CAVATICA. Is it possible there are two different age metadata fields and CAVATICA and OpenPBTA are both reporting them as "age_at_diagnosis"?
Thanks @jharenza. I talked with our data coordinator and these BAM files are from within the CBTN project on CAVATICA. Is it possible there are two different age metadata fields and CAVATICA and OpenPBTA are both reporting them as "age_at_diagnosis"?
Hi! Yes, that is what is happening - there are a few different fields and they are named differently across the multiple systems, so I am working with @baileyckelly to get the available fields we can pull into CAVATICA from Kids First to match the correct field to age_at_diagnosis
. The age_at_diagnosis
in the histologies file is the current/correct age and what is in CAVATICA looks to be the age at latest clinical update. Thanks for your pateince.
Thank you for looking into this! Do you think the mismatch for the primary site of BS_9CA93S6D, which has the primary_site
value of "Not Reported" in CAVATICA and "Brain Stem- Pons" in OpenPBTA is because this field has not been updated in CAVATICA yet?
Also the tumor_descriptor
for BS_SD8YBRBR is "Initial CNS Tumor" in OpenPBTA and "Progressive" in CAVATICA (another mismatch in tumor_descriptor
seen in BS_W4DB5RP1). I'm not sure whether this is an entry error or whether there are multiple samples for the same patient and metadata is being copied over the fields.
Hi @GeoffLyle - you should use what is in OpenPBTA, it is more up-to-date. I am not sure when CAVATICA was last updated, but it is out of date with our current databases. I can talk to our team to have that updated at some interval, but as of now, it was more of a one-time clinical patch.
@jharenza Great! My main concern was finding out which data source was most up-to-date. I will use the metadata from OpenPBTA for our analyses. That answers all the questions on my end!
I'm going to go ahead and close this issue in that case. Thanks for your help @jharenza! @GeoffLyle please post any additional questions here and I'll re-open as needed.
Reported by @GeoffLyle via email, quoting relevant parts here:
Here is the xlsx file: Cavatica_OpenPBTAhistologies_metadata_comparison_Oct2022.xlsx
Geoff notes that only these 15 samples were reviewed, so this may not capture all potential mismatches.