Granularity for TCGA is too fine

ianfore commented 7 years ago

For the level at which Datamed works, the appropriate level of granularity at which to represent TCGA is more likely to be at the level of the 32 cancers that were studied. https://cancergenome.nih.gov/cancersselected

As it stands, Datamed indexes it every file in TCGA as a separate dataset. This seems undesirable for at least a couple of reasons.

Each individual file cannot be considered a "set". Taken together the files are a set. For the granularity of metadata that Datamed uses the set might be considered to be a) the collection of data for one of the 32 diseases b) the collection of data of a particular type for one of the 32 diseases

However, b misses the value of the "set" is as a collection with multiple datatypes from the same subjects.

Without the capability in Datamed to search according to individual attributes (metadata) of files, serving the results up as long lists of files is not manageable in a way that provides useful functionality to a user. Filtering to produce sets would be a capability useful to users, but that is best left to the capabilities of the dedicated repositories and portals that present TCGA data. These tools currently allow the creation of subsets based on metadata. This capability is beyond that which is currently appropriate for Datamed.

ianfore commented 7 years ago

The table on this TCGA data portal page gives an excellent overview of the sets of data in TCGA. The Target project is in there as well as TCGA. https://gdc-portal.nci.nih.gov/projects/t Each row is a disease. Six of the columns indicate the data types (data categories) available.

ianfore commented 7 years ago

This landing page is what I would suggest is an appropriate granularity https://gdc-portal.nci.nih.gov/projects/TCGA-BRCA It's the link from the first column in the table above and corresponds to what I called 1.a) above.

naturalbeau commented 7 years ago

The GEO dataset indexed by Datamed also has the granularity problem.

yul129 commented 7 years ago

Can you be more specific, Xiaoling?

From: Xiaoling [notifications@github.com] Sent: Friday, January 13, 2017 10:19 AM To: biocaddie/prototype_issues Cc: Subscribed Subject: Re: [biocaddie/prototype_issues] Granularity for TCGA is too fine (#162)

The GEO dataset indexed by Datamed also has the granularity problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/biocaddie/prototype_issues/issues/162#issuecomment-272508599, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ALPhs-htqxLjtCiHijtQguNolsOXE0M1ks5rR8AcgaJpZM4K0ly_.

naturalbeau commented 7 years ago

For example, when search "GDS998" in datamed, we got one result from geo. But search it on the GEO website, it returns 12 results. Except the same dataset return by datamed, they also return datasets at sample level. The GDS998 contains 9 samples. GEO return each sample data as one result. But in datamed, we only index in dataset level.

jgrethe commented 7 years ago

We may need to have two GEO indices - at two levels of granularity. Question is - how will we handle this in the interface?

RuilingLiu commented 7 years ago

The metadata of GDC (TCGA + TARGET) has been updated.

https://datamed.org/search-repository.php?query=%20&searchtype=data&repository=0026

ianfore commented 6 years ago

Looks good

biocaddie / prototype_issues

Granularity for TCGA is too fine #162