Open brisk022 opened 7 years ago
These results can be quite challenging to deal with, for sure. Taking a quick look at the results of:
res = files() %>%
GenomicDataCommons::select(NULL) %>%
GenomicDataCommons::expand("cases.samples") %>%
results()
Relying on the "print" method for complex R data structures can be misleading. I use the str
function quite regularly (with switches like list.len
to limit output sizes). str(res, list.len=5)
shows:
List of 3
$ cases :List of 10
..$ c5c4b4a3-3224-4a72-a883-c99c7747e47b:'data.frame': 1 obs. of 1 variable:
.. ..$ samples:List of 1
.. .. ..$ :'data.frame': 1 obs. of 26 variables:
.. .. .. ..$ sample_type_id : chr "03"
.. .. .. ..$ updated_datetime : chr "2017-03-04T16:37:25.946840-06:00"
.. .. .. ..$ time_between_excision_and_freezing: logi NA
.. .. .. ..$ oct_embedded : logi NA
.. .. .. ..$ tumor_code_id : logi NA
.. .. .. .. [list output truncated]
..$ dd029237-d470-4b58-9cf0-11753fa60972:'data.frame': 1 obs. of 1 variable:
.. ..$ samples:List of 1
.. .. ..$ :'data.frame': 1 obs. of 26 variables:
.. .. .. ..$ sample_type_id : chr "10"
.. .. .. ..$ updated_datetime : chr "2017-03-04T16:37:25.946840-06:00"
.. .. .. ..$ time_between_excision_and_freezing: logi NA
.. .. .. ..$ oct_embedded : chr "false"
.. .. .. ..$ tumor_code_id : logi NA
.. .. .. .. [list output truncated]
..$ 3fe677f6-8329-447c-b999-5e70582624aa:'data.frame': 1 obs. of 1 variable:
.. ..$ samples:List of 1
.. .. ..$ :'data.frame': 1 obs. of 26 variables:
.. .. .. ..$ sample_type_id : chr "01"
.. .. .. ..$ updated_datetime : chr "2017-03-04T16:37:25.946840-06:00"
.. .. .. ..$ time_between_excision_and_freezing: logi NA
.. .. .. ..$ oct_embedded : chr "true"
.. .. .. ..$ tumor_code_id : logi NA
.. .. .. .. [list output truncated]
..$ 619c9069-53a8-4581-92a3-be1896fe7f66:'data.frame': 1 obs. of 1 variable:
.. ..$ samples:List of 1
.. .. ..$ :'data.frame': 1 obs. of 26 variables:
.. .. .. ..$ sample_type_id : chr "01"
.. .. .. ..$ updated_datetime : chr "2017-03-04T16:37:25.946840-06:00"
.. .. .. ..$ time_between_excision_and_freezing: logi NA
.. .. .. ..$ oct_embedded : chr "true"
.. .. .. ..$ tumor_code_id : logi NA
.. .. .. .. [list output truncated]
..$ 30d8e7a6-675a-4999-a120-62add06bff3c:'data.frame': 1 obs. of 1 variable:
.. ..$ samples:List of 1
.. .. ..$ :'data.frame': 1 obs. of 26 variables:
.. .. .. ..$ sample_type_id : chr "01"
.. .. .. ..$ updated_datetime : chr "2017-03-04T16:37:25.946840-06:00"
.. .. .. ..$ time_between_excision_and_freezing: logi NA
.. .. .. ..$ oct_embedded : chr "false"
.. .. .. ..$ tumor_code_id : logi NA
.. .. .. .. [list output truncated]
.. [list output truncated]
$ file_id: chr [1:10] "c5c4b4a3-3224-4a72-a883-c99c7747e47b" "dd029237-d470-4b58-9cf0-11753fa60972" "3fe677f6-8329-447c-b999-5e70582624aa" "619c9069-53a8-4581-92a3-be1896fe7f66" ...
$ id : chr [1:10] "c5c4b4a3-3224-4a72-a883-c99c7747e47b" "dd029237-d470-4b58-9cf0-11753fa60972" "3fe677f6-8329-447c-b999-5e70582624aa" "619c9069-53a8-4581-92a3-be1896fe7f66" ...
- attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
- attr(*, "class")= chr [1:3] "GDCfilesResults" "GDCResults" "list"
Note that each $cases
has a "samples" data.frame
embedded in it. One possible approach (of several) is to use the purrr package to do some further manipulation.
purrr::flatten(res$cases) %>% set_names(res$id) %>% flatten_df(.id="file_id")
This flattens the cases
list, sets the names of the resulting flattened list back to the file_id
so that we don't lose track of which sample goes with which file, and then flatten the "rows" of the samples into one big data frame, assigning the names of the list (the file_ids) the column specified by the .id
argument. The results, then are here:
file_id sample_type_id
1 c5c4b4a3-3224-4a72-a883-c99c7747e47b 03
2 dd029237-d470-4b58-9cf0-11753fa60972 10
3 3fe677f6-8329-447c-b999-5e70582624aa 01
4 619c9069-53a8-4581-92a3-be1896fe7f66 01
5 30d8e7a6-675a-4999-a120-62add06bff3c 01
6 17bac11f-78c2-4921-bb3a-03c5c1afbd37 01
7 17bac11f-78c2-4921-bb3a-03c5c1afbd37 10
8 725f5ede-f22b-4422-a9e0-66646538121d 01
9 c85a6f34-7b6b-4677-beac-44f06bcc5c32 01
10 acd76d89-a4d7-47ea-a1c2-480a3d200634 01
11 9c51ff3a-6c88-4f17-9c33-c630a6d10ea3 01
updated_datetime time_between_excision_and_freezing
1 2017-03-04T16:37:25.946840-06:00 NA
2 2017-03-04T16:37:25.946840-06:00 NA
3 2017-03-04T16:37:25.946840-06:00 NA
4 2017-03-04T16:37:25.946840-06:00 NA
5 2017-03-04T16:37:25.946840-06:00 NA
6 2017-03-04T16:37:25.946840-06:00 NA
7 2017-03-04T16:37:25.946840-06:00 NA
8 2017-03-04T16:37:25.946840-06:00 NA
9 2017-03-04T16:37:25.946840-06:00 NA
10 2017-03-04T16:37:25.946840-06:00 NA
11 2017-03-04T16:37:25.946840-06:00 NA
oct_embedded tumor_code_id submitter_id intermediate_dimension
1 <NA> NA TCGA-AB-2904-03A <NA>
2 false NA TCGA-EL-A3H5-10A <NA>
3 true NA TCGA-IA-A83W-01A <NA>
4 true NA TCGA-ZG-A9ND-01A <NA>
5 false NA TCGA-QH-A870-01A <NA>
6 <NA> NA TCGA-77-6842-01A 0.8
7 <NA> NA TCGA-77-6842-10A <NA>
8 false NA TCGA-A8-A06Z-01A <NA>
9 true NA TCGA-AN-A0XW-01A <NA>
10 <NA> NA TCGA-DU-7014-01A 1
11 true NA TCGA-DD-A4NR-01A <NA>
sample_id is_ffpe
1 44992adb-cabf-4c2f-9f3b-45cf97531319 FALSE
2 fc5aa545-07cc-4ad8-9eba-5b5d4ea186fb FALSE
3 2e4dfa77-839a-445d-beef-60b6396adf0c FALSE
4 949f85dd-0d5d-4b3f-a7a9-a2ddc5becf3b FALSE
5 c1c87e01-efc3-433f-ba79-4a79b292870b FALSE
6 c6d0652b-d41a-4706-a5b2-5d86b10fae96 FALSE
7 1e94ef05-bdba-4811-b1da-d2380d0d5fbe FALSE
8 993d2cba-b4f8-4a46-994b-b97bb9f10d34 FALSE
9 38bf35cd-b2f7-4532-9bff-d95cfe2cafd5 FALSE
10 050f26d9-b105-412a-9e5b-36840a1843e3 FALSE
11 71bc4fd0-374f-423e-a8b4-ae1bceceda83 FALSE
pathology_report_uuid created_datetime tumor_descriptor
1 <NA> NA NA
2 <NA> NA NA
3 10CCB12F-77E0-4100-A87A-0D36E5AF7F8B NA NA
4 89122E71-1246-44A5-9D44-4F95284EB02E NA NA
5 C84F6D0C-3879-4D63-8CE8-10D03D3C71A3 NA NA
6 4f5574a0-e1d7-427c-8f03-d08cb0b264a4 NA NA
7 <NA> NA NA
8 956F45E5-A8C6-4A4A-9D1F-D31912180584 NA NA
9 5CBC6417-4E3D-4E9C-AE93-A56B777EF2F4 NA NA
10 a848a22c-c92d-42cc-8e0b-8e260b1f5622 NA NA
11 93133C58-B1BF-4A5C-8EFC-AE17EC6A0B64 NA NA
sample_type state current_weight
1 Primary Blood Derived Cancer - Peripheral Blood live NA
2 Blood Derived Normal live NA
3 Primary Tumor live NA
4 Primary Tumor live NA
5 Primary Tumor live NA
6 Primary Tumor live NA
7 Blood Derived Normal live NA
8 Primary Tumor live NA
9 Primary Tumor live NA
10 Primary Tumor live NA
11 Primary Tumor live NA
composition time_between_clamping_and_freezing shortest_dimension tumor_code
1 NA NA <NA> NA
2 NA NA <NA> NA
3 NA NA <NA> NA
4 NA NA <NA> NA
5 NA NA <NA> NA
6 NA NA 0.5 NA
7 NA NA <NA> NA
8 NA NA <NA> NA
9 NA NA <NA> NA
10 NA NA 0.6 NA
11 NA NA <NA> NA
tissue_type days_to_sample_procurement freezing_method preservation_method
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
7 NA NA NA NA
8 NA NA NA NA
9 NA NA NA NA
10 NA NA NA NA
11 NA NA NA NA
days_to_collection initial_weight longest_dimension
1 NA NA <NA>
2 1508 NA <NA>
3 3607 140 <NA>
4 283 80 <NA>
5 1 450 <NA>
6 NA NA 1.2
7 NA NA <NA>
8 1003 330 <NA>
9 121 270 <NA>
10 NA NA 1
11 2652 240 <NA>
Thanks for the explanation! Could you add a similar example to the vignette? I think it is general enough and it would be very useful because a lot of data is nested quite deeply.
The package vignette provides an example for expanding first level fields to obtain a data frame. However, the approach does not work for deeper nested fields. For example,
produces a list with all children of the
samples
field concatenated into a comma-separated string without field names, e.g.This is of limited utility as the order of the fields cannot be trusted, so the values cannot be reliably mapped back to field names. The only work-around I could find was to provide a custom respond handler to prevent jsonlite from simplifying the vectors (and consequently other structures).
However, it would be nice if such expansion happened automatically when
results
are called.