Closed ianfore closed 6 years ago
The table on this TCGA data portal page gives an excellent overview of the sets of data in TCGA. The Target project is in there as well as TCGA. https://gdc-portal.nci.nih.gov/projects/t Each row is a disease. Six of the columns indicate the data types (data categories) available.
This landing page is what I would suggest is an appropriate granularity https://gdc-portal.nci.nih.gov/projects/TCGA-BRCA It's the link from the first column in the table above and corresponds to what I called 1.a) above.
The GEO dataset indexed by Datamed also has the granularity problem.
Can you be more specific, Xiaoling?
From: Xiaoling [notifications@github.com] Sent: Friday, January 13, 2017 10:19 AM To: biocaddie/prototype_issues Cc: Subscribed Subject: Re: [biocaddie/prototype_issues] Granularity for TCGA is too fine (#162)
The GEO dataset indexed by Datamed also has the granularity problem.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocaddie/prototype_issues/issues/162#issuecomment-272508599, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALPhs-htqxLjtCiHijtQguNolsOXE0M1ks5rR8AcgaJpZM4K0ly_.
For example, when search "GDS998" in datamed, we got one result from geo. But search it on the GEO website, it returns 12 results. Except the same dataset return by datamed, they also return datasets at sample level. The GDS998 contains 9 samples. GEO return each sample data as one result. But in datamed, we only index in dataset level.
We may need to have two GEO indices - at two levels of granularity. Question is - how will we handle this in the interface?
The metadata of GDC (TCGA + TARGET) has been updated.
https://datamed.org/search-repository.php?query=%20&searchtype=data&repository=0026
Looks good
For the level at which Datamed works, the appropriate level of granularity at which to represent TCGA is more likely to be at the level of the 32 cancers that were studied. https://cancergenome.nih.gov/cancersselected
As it stands, Datamed indexes it every file in TCGA as a separate dataset. This seems undesirable for at least a couple of reasons.
However, b misses the value of the "set" is as a collection with multiple datatypes from the same subjects.