cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

Open gwaybio opened 7 years ago

gwaybio commented 7 years ago

I have had this issue in the past (see zenodo file) and it looks like the current PANCAN_mutation file from xena has less samples and less columns than a previous version.

One of the columns we don't have is the specific nucleotide mutation and is preventing us from completing #15

It may be good to ask a direct question to the UCSC Xena Google Group. They have been helpful in the past (see #14)

gwaybio commented 7 years ago

@stephenshank

jingchunzhu commented 7 years ago

Agreed. If you have any questions regarding data from UCSC Xena. The google group is the most effective way to get the message to us.

Jing UCSC Xena http://xena.ucsc.edu

gwaybio commented 7 years ago

Thanks @jingchunzhu - your group continues to be very helpful!

For the cognoma community, here is a link to the UCSC Xena google group discussion about this issue

dhimmel commented 7 years ago

@gwaygenomics, with @jingchunzhu's latest reply (quoted below), what's the status of this issue? Basically, should it be closed or is the issue ongoing (and what's the next step to progress forward)?


I have noticed that different versions of PanCancer mutation data have been deposited in the browser here. The current version is not a complete representation of previous versions. It lacks several columns and several samples that were present in a previous version the last time I accessed the data on 12 June 2015.

Yes. These are not from the same release. However I am surprised by "lacking several columns", could you say how the columns are different?

Sample number change is not surprising because we periodically update TCGA data. This particular dataset is compiled by the Xena team at UCSC, and in almost all cases, TCGA has multiple version of mutation calls from several sequencing and analysis groups, broad, WashU, BCM, and UCSC, plus there are curated and automated calls, plus there are different sequencing platforms. So we made our internal decision on which dataset to include, and the exact selection has been changed over time, not drastically, but there are changes. The change will effect sample numbers.

Is there a place where data is versioned besides in the JSON metadata (see discussion here)?

Starting 2016, we store our release data on AWS S3, which means that all versions of data starting 2016 will be on S3. We plan to do so in the future as long as there is resource to sustain it. .json files are part of the data releases, which will stores the version information. Our previous data releases are not on S3. Do you need the previous version that you retrieved in June 12? We can send to you directly.

Jing

gwaybio commented 7 years ago

@dhimmel - I responded to @jingchunzhu on the google groups but the message was not posted. Not sure what happened here.

My post listed the different columns between the two versions. There were many more columns in the older version. Perhaps @jingchunzhu is looking into it before passing my comment through the moderators?

dhimmel commented 7 years ago

I responded to @jingchunzhu on the google groups but the message was not posted. Not sure what happened here.

@gwaygenomics good to know. Give it time -- there is a delay between posting and the message appearing (perhaps an approval stage with a poor user experience). I actually posted a suggestion to move the Google Group to GitHub issues to avoid these blocks, although this post is also currently hidden.

jingchunzhu commented 7 years ago

I don't see either of the two messages. Not sure what's going on. Sorry.

​>​ Perhaps @jingchunzhu is looking into it before passing my comment ​ through the moderators?

​Greg, I don't know if the message will show up​ at all. Can you email me with your post that did not get through?

dhimmel commented 7 years ago

@jingchunzhu, every time I post to the Google Group there is a substantial delay till it appears. I'm pretty sure the messages will show up if we wait.

dhimmel commented 7 years ago

@jingchunzhu I'm starting to think that @gwaygenomics and my posts may actually be permanently missing this time. Is the Google Group moderated and if so, can you confirm that our posts are not waiting on approval?

jingchunzhu commented 7 years ago

Yes. It is moderated. I think because Mary is off on vacation till next Monday. All incoming posts is in the to be approved queue. I will talk to her to give me approval permission after she comes back. Or perhaps to see if there is a feature in google group that can give some people or some accounts permission to bypass moderation.

dhimmel commented 7 years ago

So It seems like one of the reasons for missing samples could be the upgrade to hg19 I'm content with the fluctuation in sample number -- we'll work with whatever the latest release from Xena contains.

@gwaygenomics, it seems that there is still one outstanding question before we can close this issue. You mention that a previous release of PANCAN_mutation contained an extra column that could help us interpret the amino acid effect of variant? What exactly was this variable and its name?

gwaybio commented 7 years ago

@dhimmel @jingchunzhu sorry for the late response. The exact variables are HGVSc and HGVSp - they are the actual mutation calls at the DNA level and at the protein level.

Thanks!