Bioconductor / GenomicDataCommons

Provide R access to the NCI Genomic Data Commons portal.
http://bioconductor.github.io/GenomicDataCommons/
84 stars 23 forks source link

expand for deeply-nested fields #47

Open brisk022 opened 7 years ago

brisk022 commented 7 years ago

The package vignette provides an example for expanding first level fields to obtain a data frame. However, the approach does not work for deeper nested fields. For example,

files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   results()

produces a list with all children of the samples field concatenated into a comma-separated string without field names, e.g.

$cases
$cases$`3fe677f6-8329-447c-b999-5e70582624aa`
samples
1 01, 2017-03-04T16:37:25.946840-06:00, NA, true, NA, TCGA-IA-A83W-01A, NA, 2e4dfa77-839a-445d-beef-60b6396adf0c, FALSE, 10CCB12F-77E0-4100-A87A-0D36E5AF7F8B, NA, NA, Primary Tumor, live, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3607, 140, NA

This is of limited utility as the order of the fields cannot be trusted, so the values cannot be reliably mapped back to field names. The only work-around I could find was to provide a custom respond handler to prevent jsonlite from simplifying the vectors (and consequently other structures).

respHandler <- function(txt, ...) { jsonlite::fromJSON(txt, simplifyVector = F) }
files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   response(response_handler = respHandler) %$%
   lapply(results, unlist, recursive = T) %>%
   lapply(as.list) %>%
   bind_rows()

However, it would be nice if such expansion happened automatically when results are called.

seandavi commented 7 years ago

These results can be quite challenging to deal with, for sure. Taking a quick look at the results of:

res = files() %>% 
   GenomicDataCommons::select(NULL) %>%
   GenomicDataCommons::expand("cases.samples") %>%
   results()

Relying on the "print" method for complex R data structures can be misleading. I use the str function quite regularly (with switches like list.len to limit output sizes). str(res, list.len=5) shows:

List of 3
 $ cases  :List of 10
  ..$ c5c4b4a3-3224-4a72-a883-c99c7747e47b:'data.frame':    1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':  1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "03"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : logi NA
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ dd029237-d470-4b58-9cf0-11753fa60972:'data.frame':    1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':  1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "10"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "false"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ 3fe677f6-8329-447c-b999-5e70582624aa:'data.frame':    1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':  1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "01"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "true"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ 619c9069-53a8-4581-92a3-be1896fe7f66:'data.frame':    1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':  1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "01"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "true"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  ..$ 30d8e7a6-675a-4999-a120-62add06bff3c:'data.frame':    1 obs. of  1 variable:
  .. ..$ samples:List of 1
  .. .. ..$ :'data.frame':  1 obs. of  26 variables:
  .. .. .. ..$ sample_type_id                    : chr "01"
  .. .. .. ..$ updated_datetime                  : chr "2017-03-04T16:37:25.946840-06:00"
  .. .. .. ..$ time_between_excision_and_freezing: logi NA
  .. .. .. ..$ oct_embedded                      : chr "false"
  .. .. .. ..$ tumor_code_id                     : logi NA
  .. .. .. .. [list output truncated]
  .. [list output truncated]
 $ file_id: chr [1:10] "c5c4b4a3-3224-4a72-a883-c99c7747e47b" "dd029237-d470-4b58-9cf0-11753fa60972" "3fe677f6-8329-447c-b999-5e70582624aa" "619c9069-53a8-4581-92a3-be1896fe7f66" ...
 $ id     : chr [1:10] "c5c4b4a3-3224-4a72-a883-c99c7747e47b" "dd029237-d470-4b58-9cf0-11753fa60972" "3fe677f6-8329-447c-b999-5e70582624aa" "619c9069-53a8-4581-92a3-be1896fe7f66" ...
 - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
 - attr(*, "class")= chr [1:3] "GDCfilesResults" "GDCResults" "list"

Note that each $cases has a "samples" data.frame embedded in it. One possible approach (of several) is to use the purrr package to do some further manipulation.

purrr::flatten(res$cases) %>% set_names(res$id) %>% flatten_df(.id="file_id")

This flattens the cases list, sets the names of the resulting flattened list back to the file_id so that we don't lose track of which sample goes with which file, and then flatten the "rows" of the samples into one big data frame, assigning the names of the list (the file_ids) the column specified by the .id argument. The results, then are here:

                                file_id sample_type_id
1  c5c4b4a3-3224-4a72-a883-c99c7747e47b             03
2  dd029237-d470-4b58-9cf0-11753fa60972             10
3  3fe677f6-8329-447c-b999-5e70582624aa             01
4  619c9069-53a8-4581-92a3-be1896fe7f66             01
5  30d8e7a6-675a-4999-a120-62add06bff3c             01
6  17bac11f-78c2-4921-bb3a-03c5c1afbd37             01
7  17bac11f-78c2-4921-bb3a-03c5c1afbd37             10
8  725f5ede-f22b-4422-a9e0-66646538121d             01
9  c85a6f34-7b6b-4677-beac-44f06bcc5c32             01
10 acd76d89-a4d7-47ea-a1c2-480a3d200634             01
11 9c51ff3a-6c88-4f17-9c33-c630a6d10ea3             01
                   updated_datetime time_between_excision_and_freezing
1  2017-03-04T16:37:25.946840-06:00                                 NA
2  2017-03-04T16:37:25.946840-06:00                                 NA
3  2017-03-04T16:37:25.946840-06:00                                 NA
4  2017-03-04T16:37:25.946840-06:00                                 NA
5  2017-03-04T16:37:25.946840-06:00                                 NA
6  2017-03-04T16:37:25.946840-06:00                                 NA
7  2017-03-04T16:37:25.946840-06:00                                 NA
8  2017-03-04T16:37:25.946840-06:00                                 NA
9  2017-03-04T16:37:25.946840-06:00                                 NA
10 2017-03-04T16:37:25.946840-06:00                                 NA
11 2017-03-04T16:37:25.946840-06:00                                 NA
   oct_embedded tumor_code_id     submitter_id intermediate_dimension
1          <NA>            NA TCGA-AB-2904-03A                   <NA>
2         false            NA TCGA-EL-A3H5-10A                   <NA>
3          true            NA TCGA-IA-A83W-01A                   <NA>
4          true            NA TCGA-ZG-A9ND-01A                   <NA>
5         false            NA TCGA-QH-A870-01A                   <NA>
6          <NA>            NA TCGA-77-6842-01A                    0.8
7          <NA>            NA TCGA-77-6842-10A                   <NA>
8         false            NA TCGA-A8-A06Z-01A                   <NA>
9          true            NA TCGA-AN-A0XW-01A                   <NA>
10         <NA>            NA TCGA-DU-7014-01A                      1
11         true            NA TCGA-DD-A4NR-01A                   <NA>
                              sample_id is_ffpe
1  44992adb-cabf-4c2f-9f3b-45cf97531319   FALSE
2  fc5aa545-07cc-4ad8-9eba-5b5d4ea186fb   FALSE
3  2e4dfa77-839a-445d-beef-60b6396adf0c   FALSE
4  949f85dd-0d5d-4b3f-a7a9-a2ddc5becf3b   FALSE
5  c1c87e01-efc3-433f-ba79-4a79b292870b   FALSE
6  c6d0652b-d41a-4706-a5b2-5d86b10fae96   FALSE
7  1e94ef05-bdba-4811-b1da-d2380d0d5fbe   FALSE
8  993d2cba-b4f8-4a46-994b-b97bb9f10d34   FALSE
9  38bf35cd-b2f7-4532-9bff-d95cfe2cafd5   FALSE
10 050f26d9-b105-412a-9e5b-36840a1843e3   FALSE
11 71bc4fd0-374f-423e-a8b4-ae1bceceda83   FALSE
                  pathology_report_uuid created_datetime tumor_descriptor
1                                  <NA>               NA               NA
2                                  <NA>               NA               NA
3  10CCB12F-77E0-4100-A87A-0D36E5AF7F8B               NA               NA
4  89122E71-1246-44A5-9D44-4F95284EB02E               NA               NA
5  C84F6D0C-3879-4D63-8CE8-10D03D3C71A3               NA               NA
6  4f5574a0-e1d7-427c-8f03-d08cb0b264a4               NA               NA
7                                  <NA>               NA               NA
8  956F45E5-A8C6-4A4A-9D1F-D31912180584               NA               NA
9  5CBC6417-4E3D-4E9C-AE93-A56B777EF2F4               NA               NA
10 a848a22c-c92d-42cc-8e0b-8e260b1f5622               NA               NA
11 93133C58-B1BF-4A5C-8EFC-AE17EC6A0B64               NA               NA
                                       sample_type state current_weight
1  Primary Blood Derived Cancer - Peripheral Blood  live             NA
2                             Blood Derived Normal  live             NA
3                                    Primary Tumor  live             NA
4                                    Primary Tumor  live             NA
5                                    Primary Tumor  live             NA
6                                    Primary Tumor  live             NA
7                             Blood Derived Normal  live             NA
8                                    Primary Tumor  live             NA
9                                    Primary Tumor  live             NA
10                                   Primary Tumor  live             NA
11                                   Primary Tumor  live             NA
   composition time_between_clamping_and_freezing shortest_dimension tumor_code
1           NA                                 NA               <NA>         NA
2           NA                                 NA               <NA>         NA
3           NA                                 NA               <NA>         NA
4           NA                                 NA               <NA>         NA
5           NA                                 NA               <NA>         NA
6           NA                                 NA                0.5         NA
7           NA                                 NA               <NA>         NA
8           NA                                 NA               <NA>         NA
9           NA                                 NA               <NA>         NA
10          NA                                 NA                0.6         NA
11          NA                                 NA               <NA>         NA
   tissue_type days_to_sample_procurement freezing_method preservation_method
1           NA                         NA              NA                  NA
2           NA                         NA              NA                  NA
3           NA                         NA              NA                  NA
4           NA                         NA              NA                  NA
5           NA                         NA              NA                  NA
6           NA                         NA              NA                  NA
7           NA                         NA              NA                  NA
8           NA                         NA              NA                  NA
9           NA                         NA              NA                  NA
10          NA                         NA              NA                  NA
11          NA                         NA              NA                  NA
   days_to_collection initial_weight longest_dimension
1                  NA             NA              <NA>
2                1508             NA              <NA>
3                3607            140              <NA>
4                 283             80              <NA>
5                   1            450              <NA>
6                  NA             NA               1.2
7                  NA             NA              <NA>
8                1003            330              <NA>
9                 121            270              <NA>
10                 NA             NA                 1
11               2652            240              <NA>
brisk022 commented 7 years ago

Thanks for the explanation! Could you add a similar example to the vignette? I think it is general enough and it would be very useful because a lot of data is nested quite deeply.