Open jfy133 opened 5 years ago
Cool, thank you for this hint! Having thought about the output format in #28 & #31 a bit already, I'm happy to collect more voices / votes on this, or review a PR to make this output the default.
If that NA
problem in the reference
dataframe (and possibly others) can be solved, that is.
Is your grouping into = c("bacdive_id", "section", "subsection", "field", "key")
very specific to your application or data analysis? Or do you consider it general?
I think the reference
metadata can be fixed when converting to a table (based on a condition of the object in the cell before un-nesting), but I personally don't need that information at the moment so I didn't invest time in solving it.
The grouping was selected based on the names as defined in the description of various example search outputs (e.g. https://bacdive.dsmz.de/api/bacdive/bacdive_id/1/) that I checked. I also tried providing extra columns for separate
to spread over, but I never needed more than 4 metadata columns (after the bacdiveid.
I have only done fuzzy taxon name searches though (e.g. search term "Fusobacterium"), I'm not familiar with the rest of the database so I don't know if any other metadata can appear.
But in terms of votes, I personally always prefer easily accesible 'tidy' data ;).
Edit: the only issue is the converting to a tibble with the above code is that it can sometimes take a while if you have many bacdive IDs. I don't know whether speed optimisation is important for this package, but one would maybe have to switch away from tidyverse
functions if so (and convert to a tibble after unnesting and separating).
Thanks for the additional info :-) Speed is indeed a consideration, but in all my measurements so far, BacDive's server was the bottleneck. Until they speed it up, I wouldn't be worried about something like your above %>%
-line example ;-)
Looking into these NA
s, I find that for example the ID_reference
field appears in several nesting "depths":
> str(data_bacdive_raw[["2654"]][["strain_availability"]][["strain_history"]])
'data.frame': 1 obs. of 2 variables:
$ history : chr "<- ATCC <- L.DS. Smith, VPI 2488 <- H. Beerens, PCL"
$ ID_reference: int 626
> str(data_bacdive_raw[["2654"]][["references"]])
'data.frame': 3 obs. of 2 variables:
$ ID_reference: int 626 20215 20218
$ reference : chr "Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295" "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria "| __truncated__ "Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for"| __truncated__
This causes a "left-/up-ward shift/creep" of the NA
s in the tibble:
Do you mean this with "converting to a table (based on a condition of the object in the cell before un-nesting)"?
Indeed - the server is for an average search still the slowest thing, taking longer than the 'table-isation' itself.
Yes, screenshot 2 is exactly what I mean.
I realise now I shouldn't have used the term 'unnesting' as that isn't what I actually meant. I actually meant that the
separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field", "key")) %>%
could be conditional e.g. if the second field of the unlisted string grouped_category
matches "references" (like in lines 15-17), this could be separated across just c("bacdive_id", "section", "field")
.
This would at least match the description here: https://bacdive.dsmz.de/api/bacdive/bacdive_id/2654/.
I just realised the 'key' column is leftover from testing (before I renamed the columns to the bacdive categories). Only lines 15-17 is the issue. Thus this should have the correct columns and also have the condition for correcting references lines:
## get some search results
data_bacdive_raw <- BacDiveR::retrieve_data("Fusobacterium", searchType = "taxon")
## original pipe for converting list of lists to tibble
data_bacdive_tib <- data_bacdive_raw %>%
unlist() %>%
bind_rows() %>%
gather(grouped_category, value, 1:ncol(.)) %>%
separate(grouped_category, sep = "\\.", into = c("bacdive_id", "section", "subsection", "field"))
## shows faulty reference column incorrectly putting field in subsection
data_bacdive_tib %>% filter(is.na(field))
#># A tibble: 144 x 5
#> bacdive_id section subsection field value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2654 references ID_referenc… NA 626
#> 2 2654 references ID_referenc… NA 20215
#> 3 2654 references ID_referenc… NA 20218
#> 4 2654 references reference1 NA Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295
#> 5 2654 references reference2 NA "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea, va…
#> 6 2654 references reference3 NA Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganisms.…
#> 7 5758 references ID_referenc… NA 9019
#> 8 5758 references ID_referenc… NA 20215
#> 9 5758 references ID_referenc… NA 20218
#>10 5758 references reference1 NA Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699
#> # ... with 134 more rows
## now fix the references field
data_bacdive_tib_fixed <- data_bacdive_tib %>%
mutate(field = if_else(section == "references", subsection, field),
subsection = if_else(section == "references", NA_character_, subsection))
## to show ID_references now correctly not in subsection
data_bacdive_tib %>% filter(is.na(field))
#> # A tibble: 0 x 5
#> # ... with 5 variables: bacdive_id <chr>, section <chr>, subsection <chr>, field <chr>, value <chr>
data_bacdive_tib_fixed %>% filter(is.na(subsection))
## shows ID_references now correctly in field
#># A tibble: 144 x 5
#> bacdive_id section subsection field value
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2654 references NA ID_refere… 626
#> 2 2654 references NA ID_refere… 20215
#> 3 2654 references NA ID_refere… 20218
#> 4 2654 references NA reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 1295
#> 5 2654 references NA reference2 "D.Gleim, M.Kracht, N.Weiss et. al.: Prokaryotic Nomenclature Up-to-date - compilation of all names of Bacteria and Archaea,…
#> 6 2654 references NA reference3 Verslyppe, B., De Smet, W., De Baets, B., De Vos, P., Dawyndt P. StrainInfo introduces electronic passports for microorganis…
#> 7 5758 references NA ID_refere… 9019
#> 8 5758 references NA ID_refere… 20215
#> 9 5758 references NA ID_refere… 20218
#>10 5758 references NA reference1 Leibniz Institut DSMZ-Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH; Curators of the DSMZ; DSM 20699
#># ... with 134 more rows
Apologies for the confusion. I should've put in my original message the caveat: written after dealing with teething baby all day, may not make 100% sense
Note to self: https://github.com/ropensci/roadoi#whats-returned may be a useful example to check, also their list-column use.
First I want to say thank you for this package, I'm working on some metagenomic data with lots of 'unusual' taxa, and trying to find a good (accessible) database to get a quick summary of characteristics of these has been surprisingly difficult.
This package saved me a lot of headaches trying 'manually' parse the API search results myself
I have neither a bug nor feature request, rather just some info which might be useful for others.
You can use a sequence of tidyverse tools convert the results from the
BacDiveR::retrieve_data()
function to a clean(ish) table format using the following code:As far as I can see with the table from the search above the only issue is the references field is not correctly formatted (being placed in the subsection rather than field column - thus the 'NA' messages), because in the original results it is a dataframe rather than a list itself.
This worked for me using BacDiveR_0.7.0