bmaitner / RBIEN

Tools for accessing the Botanical Information and Ecology Network (BIEN) database
http://bien.nceas.ucsb.edu/bien/
Other
43 stars 10 forks source link

BIEN_trait_mean performance #16

Open achmurzy opened 4 years ago

achmurzy commented 4 years ago

I'm trying to pull as many trait means as possible for the following list of species: names.txt

Using vectorized versions of BIEN_trait_mean(vector_of_species_names, vector_of_traits) usually crashes my R console. I'm not sure if its on the backend, but returning the list of trait ids by default could be part of the issue. Maybe we could add a flag to optionally add the list of trait IDs? It greatly increases the size of the data frame that gets returned, and it would be nice if it were optional.

So what I'm doing now is querying means one-by-one: for species in species_list: for trait in trait_list: BIEN_trait_mean(species, trait) rbind(traits, new_trait) This isn't the 'R' way of doing it but it works quickly - vectorizing a list of 20 species crashes my console.

achmurzy commented 4 years ago

Okay playing with this further I was able to determine that: -The trait IDs aren't the problem, at least I don't think so -rather, I didn't realize that BIEN_trait_mean is only intended to return one trait at a time. I had been inputting a vector of traits like so: trait_list <- BIEN_trait_list() BIEN_trait_mean(species, trait_list) to pull everything. This returns the warning: In if (!trait %in% traits_available$trait_name) { : the condition has length > 1 and only the first element will be used Then returned traits all have the same value. 1 Pentaclethra macrophylla 15.7878787878788 flower color 2 Pentaclethra macrophylla 15.7878787878788 flower pollination syndrome cm 3 Pentaclethra macrophylla 15.7878787878788 fruit type 4 Pentaclethra macrophylla 15.7878787878788 inflorescence length cm level_used sample_size 1 Family 533 2 Family 533 3 Family 533 4 Family 533

I think it will be common for people to want to pull every trait and to call the function as I did above. Right now you have to write a for-loop to do it one at a time (which works great and is pretty fast). However, it might be better to prevent putting multiple traits into BIEN_trait_mean, or make sure it supports vectorized trait lists.

-Finally, Querying DBH also tends to be extremely slow as you suggested and I think you're right about this crashing the console. In particular, calculating mean DBH at the Family level could be drawing many thousands of records without being very informative. Additionally, the trait 'whole plant height' seems to behave the same way. The R process gets 'Killed' probably because the SQL query returns way too much stuff. Maybe DBH data should only be available through the stem.R module? These are traits that take > 15 minutes to query data then eventually just crash the console, so maybe higher density measurements need some special treatment. The other traits return values in less than 30 seconds.