TIBHannover / BacDiveR

Inofficial R client for the DSMZ's Bacterial Diversity Metadatabase (former contact: @katrinleinweber). https://api.bacdive.dsmz.de/client_examples seems to be the official alternatives.
https://TIBHannover.GitHub.io/BacDiveR/
MIT License
10 stars 12 forks source link

Converting retrieve_data() results to a data frame (tibble) #100

Open jfy133 opened 5 years ago

jfy133 commented 5 years ago

First I want to say thank you for this package, I'm working on some metagenomic data with lots of 'unusual' taxa, and trying to find a good (accessible) database to get a quick summary of characteristics of these has been surprisingly difficult.

This package saved me a lot of headaches trying 'manually' parse the API search results myself

I have neither a bug nor feature request, rather just some info which might be useful for others.

You can use a sequence of tidyverse tools convert the results from the BacDiveR::retrieve_data() function to a clean(ish) table format using the following code:

## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")

## convert list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>% 
  unlist() %>% 
  bind_rows() %>% 
  gather(grouped_category, value, 1:ncol(.)) %>%
  separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%
  distinct()

#>Warning message:
#>Expected 4 pieces. Missing pieces filled with `NA` in 144 rows [72, 73, 74, 75, 76, 77, 156, 157, 158, 159, 160, 161, 250, 251, 252, 253, 254, 255, 461, 462, ...]. 

## print final table
data_bacdive_tib

#># A tibble: 18,555 x 6
#>   bacdive_id section       subsection      field              key   value           
#>   <chr>      <chr>         <chr>           <chr>              <chr> <chr>           
#> 1 2654       taxonomy_name strains_tax_PNU species_epithet    NA    mortiferum      
#> 2 2654       taxonomy_name strains_tax_PNU subspecies_epithet NA    NA              
#> 3 2654       taxonomy_name strains_tax_PNU is_type_strain     NA    FALSE           
#> 4 2654       taxonomy_name strains_tax_PNU domain             NA    Bacteria        
#> 5 2654       taxonomy_name strains_tax_PNU phylum             NA    Fusobacteria    
#> 6 2654       taxonomy_name strains_tax_PNU class              NA    Fusobacteriia   
#> 7 2654       taxonomy_name strains_tax_PNU ordo               NA    NA              
#> 8 2654       taxonomy_name strains_tax_PNU family             NA    Fusobacteriaceae
#> 9 2654       taxonomy_name strains_tax_PNU status_fam         NA    NA              
#>10 2654       taxonomy_name strains_tax_PNU genus              NA    Fusobacterium   

As far as I can see with the table from the search above the only issue is the references field is not correctly formatted (being placed in the subsection rather than field column - thus the 'NA' messages), because in the original results it is a dataframe rather than a list itself.

This worked for me using BacDiveR_0.7.0

katrinleinweber commented 5 years ago

Cool, thank you for this hint! Having thought about the output format in #28 & #31 a bit already, I'm happy to collect more voices / votes on this, or review a PR to make this output the default.

katrinleinweber commented 5 years ago

If that NA problem in the reference dataframe (and possibly others) can be solved, that is.

Is your grouping into = c("bacdive_id", "section", "subsection", "field", "key") very specific to your application or data analysis? Or do you consider it general?

jfy133 commented 5 years ago

I think the reference metadata can be fixed when converting to a table (based on a condition of the object in the cell before un-nesting), but I personally don't need that information at the moment so I didn't invest time in solving it.

The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for separate to spread over, but I never needed more than 4 metadata columns (after the bacdiveid.

I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear.

But in terms of votes, I personally always prefer easily accesible 'tidy' data ;).

Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from tidyverse functions if so (and convert to a tibble after unnesting and separating).

katrinleinweber commented 5 years ago

Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above %>%-line example ;-)

Looking into these NAs, I find that for example the ID_reference field appears in several nesting "depths":

> str(data_bacdive_raw[["2654"]][["strain_availability"]][["strain_history"]])
'data.frame':   1 obs. of  2 variables:
 $ history     : chr "<- ATCC <- L.DS. Smith, VPI 2488 <- H. Beerens, PCL"
 $ ID_reference: int 626

> str(data_bacdive_raw[["2654"]][["references"]])
'data.frame':   3 obs. of  2 variables:
 $ ID_reference: int  626 20215 20218
 $ reference   : chr  "Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295" "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria "| __truncated__ "Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for"| __truncated__

screen shot 2018-11-30 at 21 15 14

This causes a "left-/up-ward shift/creep" of the NAs in the tibble:

screen shot 2018-11-30 at 21 13 21

Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"?

jfy133 commented 5 years ago

Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself.

Yes, screenshot 2 is exactly what I mean.

I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the

separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%

could be conditional e.g. if the second field of the unlisted string grouped_category matches "references" (like in lines 15-17), this could be separated across just c("bacdive_id", "section", "field").

This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/.

jfy133 commented 5 years ago

I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines:

## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")

## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>% 
  unlist() %>% 
  bind_rows() %>% 
  gather(grouped_category, value, 1:ncol(.)) %>%
  separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))

## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))

#># A tibble: 144 x 5
#>   bacdive_id section    subsection   field value                                                                                                                           
#>   <chr>      <chr>      <chr>        <chr> <chr>                                                                                                                           
#> 1 2654       references ID_referenc… NA    626                                                                                                                             
#> 2 2654       references ID_referenc… NA    20215                                                                                                                           
#> 3 2654       references ID_referenc… NA    20218                                                                                                                           
#> 4 2654       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295               
#> 5 2654       references reference2   NA    "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654       references reference3   NA    Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758       references ID_referenc… NA    9019                                                                                                                            
#> 8 5758       references ID_referenc… NA    20215                                                                                                                           
#> 9 5758       references ID_referenc… NA    20218                                                                                                                           
#>10 5758       references reference1   NA    Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699   
#> # ... with 134 more rows

## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>% 
  mutate(field = if_else(section == "references", subsection, field),
                  subsection = if_else(section == "references", NA_character_, subsection))

## to show ID_references now correctly not in subsection

data_bacdive_tib %>% filter(is.na(field))

#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>

data_bacdive_tib_fixed %>% filter(is.na(subsection))

## shows ID_references now correctly in field
#># A tibble: 144 x 5
#>   bacdive_id section    subsection field      value                                                                                                                        
#>   <chr>      <chr>      <chr>      <chr>      <chr>                                                                                                                        
#> 1 2654       references NA         ID_refere… 626                                                                                                                          
#> 2 2654       references NA         ID_refere… 20215                                                                                                                        
#> 3 2654       references NA         ID_refere… 20218                                                                                                                        
#> 4 2654       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295            
#> 5 2654       references NA         reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654       references NA         reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758       references NA         ID_refere… 9019                                                                                                                         
#> 8 5758       references NA         ID_refere… 20215                                                                                                                        
#> 9 5758       references NA         ID_refere… 20218                                                                                                                        
#>10 5758       references NA         reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699           
#># ... with 134 more rows

Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense

katrinleinweber commented 5 years ago

Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use.