brunobrr / bdc

Check out the vignettes with detailed documentation on each module of the bdc package
https://brunobrr.github.io/bdc
GNU General Public License v3.0
23 stars 7 forks source link

Improve the runtime of `bdc_query_names_taxadb` #252

Closed black-snow closed 5 months ago

black-snow commented 1 year ago

Hey it's me again ;)

We use bdc a lot and atm I'm investigating why our tests take so long. I found that bdc_query_names_taxadb takes 15-16 seconds on my (decent) machine when it yields no result. Successful queries are way faster.

Example:

bdc::bdc_query_names_taxadb(sci_name='Triops granitica', rank_name="Animalia", rank="kingdom")

I didn't drill down into what actually happens there (yet). I guess that an exhaustive search is just way slower and that the code exits early on hit? Is there some way to speed things up in the former case?

I also noticed that there's an unused parameter parallel defaulting to FALSE (I was hoping I could just flick it and reap crazy speed-ups).

It might be a good idea to inject a mock in my tests anyway. But as I'm retrofitting the tests an (obvious) way to speed things up would come in handy.

kguidonimartins commented 1 year ago

We are glad that the bdc is useful for you.

Have you tried changing the default arguments to parallel=TRUE and ncores=parallel::detectCores()? This can speed up the search, but it doesn't make sense for just one species. Also, 15 seconds isn't that bad, IMHO. We're searching the entire GBIF database (when db="gbif"), so this might take a while.

start_time_single <- Sys.time()
bdc::bdc_query_names_taxadb(
  sci_name = "Triops granitica",
  rank_name = "Animalia",
  rank = "kingdom"
)
#> 
#> Querying using gbif database version 22.12
#> 
#>  A total of 0 NA was/were found in sci_name.
#> 
#>  1 names queried in 0.3 minutes
#> # A tibble: 1 × 21
#>   origin…¹ sugge…² dista…³ notes taxonID scien…⁴ taxon…⁵ taxon…⁶ accep…⁷ kingdom
#>   <chr>    <lgl>   <lgl>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 Triops … NA      NA      notF… <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
#> # … with 11 more variables: phylum <chr>, class <chr>, order <chr>,
#> #   family <chr>, genus <chr>, specificEpithet <chr>,
#> #   infraspecificEpithet <chr>, parentNameUsageID <chr>,
#> #   originalNameUsageID <chr>, scientificNameAuthorship <chr>,
#> #   vernacularName <chr>, and abbreviated variable names ¹​original_search,
#> #   ²​suggested_name, ³​distance, ⁴​scientificName, ⁵​taxonRank, ⁶​taxonomicStatus,
#> #   ⁷​acceptedNameUsageID
end_time_single <- Sys.time()
time_result_single <- difftime(time1 = end_time_single, time2 = start_time_single, units = "mins")
time_result_single
#> Time difference of 0.4094008 mins

start_time_parallel <- Sys.time()
bdc::bdc_query_names_taxadb(
  sci_name = "Triops granitica",
  rank_name = "Animalia",
  rank = "kingdom",
  parallel = TRUE,
  ncores = parallel::detectCores()
)
#> 
#> Querying using gbif database version 22.12
#> 
#>  A total of 0 NA was/were found in sci_name.
#> 
#>  1 names queried in 0.4 minutes
#> # A tibble: 1 × 21
#>   origin…¹ sugge…² dista…³ notes taxonID scien…⁴ taxon…⁵ taxon…⁶ accep…⁷ kingdom
#>   <chr>    <lgl>   <lgl>   <chr> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
#> 1 Triops … NA      NA      notF… <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
#> # … with 11 more variables: phylum <chr>, class <chr>, order <chr>,
#> #   family <chr>, genus <chr>, specificEpithet <chr>,
#> #   infraspecificEpithet <chr>, parentNameUsageID <chr>,
#> #   originalNameUsageID <chr>, scientificNameAuthorship <chr>,
#> #   vernacularName <chr>, and abbreviated variable names ¹​original_search,
#> #   ²​suggested_name, ³​distance, ⁴​scientificName, ⁵​taxonRank, ⁶​taxonomicStatus,
#> #   ⁷​acceptedNameUsageID
end_time_parallel <- Sys.time()
time_result_parallel <- difftime(time1 = end_time_parallel, time2 = start_time_parallel, units = "mins")
time_result_parallel
#> Time difference of 0.3824489 mins

Created on 2023-03-30 with reprex v2.0.2

black-snow commented 1 year ago

@kguidonimartins thanks for the quick reply and the test! No, for parallel is unused in the method body and ncores is only used when suggest names is used.

I didn't dig into what bdc does here behind the curtains but shouldn't a lookup without any fuzzy search be in O(1)? Or at least in O(log(n)) with, e.g., a skip list?

kguidonimartins commented 1 year ago

Hi @black-snow ,

I apologize, but I don't have much experience in computer science, especially with algorithms' complexity. I'm not sure if the other maintainers can help you with your question either.

However, we built bdc using what we believe are the best tools available (tidyverse and similar software). So, we're currently using the algorithms that are integrated with those tools.

I personally use bdc every day for my job, and it works really well for me. While we know that performance can be an issue, for now, I'm not too bothered by the 15-second wait time.

But we're always open to suggestions and improvements! If you have any ideas or recommendations, please feel free to submit a pull request. We would be thrilled if your expertise could help us improve the performance of bdc functions.

black-snow commented 1 year ago

Alright, gotta do some digging then - can we leave this open?

I expected an index over the names, for fast look-ups.
I quickly glanced over the code and it looks like you're loading complete data frames from CSV files. I don't know what use cases there are yet but if exact name searches are one of them it would make sense to keep an index. We could create it at build time or at install time. Something built-in as an RDS (?) file or sqlite or even just a sorted columnar skip list (or binary seach). Or a bloom (or cuckoo) filter for my exact case.

kguidonimartins commented 1 year ago

Alright, gotta do some digging then - can we leave this open?

For sure.

kguidonimartins commented 1 year ago

I quickly glanced over the code and it looks like you're loading complete data frames from CSV files.

There is no CSV reading. We are using the taxadb database to query species names, which uses duckdb under the hood.

kguidonimartins commented 5 months ago

hi @black-snow, any update on this matter?

black-snow commented 5 months ago

Hey @kguidonimartins nope, sorry. I don't see me having time to dig into this in the next couple of months, sadly. Please feel free to just close the issue for now.

kguidonimartins commented 5 months ago

thanks for the effort!