Memory optimization for large datasets

BiologicalRecordsCentre / sparta

Species Presence/Absence R Trends Analyses

http://biologicalrecordscentre.github.io/sparta/index.html

MIT License

21 stars 24 forks source link

Memory optimization for large datasets #241

Open JHHatfield opened 2 years ago

JHHatfield commented 2 years ago

Some of the functions have relatively high memory requirements (e.g. formatOccData and OccDetFunc). From what I can see these are mostly caused by the cast and merge steps. I have replaced some of the reshape2 functions in formatOccData with data.table ones. This seems to reduce the memory requirement and work as a small fix but looks complex to do in a comprehensive way.

AugustT commented 2 years ago

@JHHatfield Nice to see you are still swimming these waters. It would be good to make a record of where else you see these changes needed to aid future work to overhaul and use data.table functions. Also would you like to make a pull request with the changes you have already made?

JHHatfield commented 2 years ago

I have submitted my quick fix for formatOccData which deals with the size limit faced by the reshape2 version of dcast. The issue is that data.table requires data tables instead of data frames. The syntax differences mean a lot of changes would be needed for a full overhaul. I got around it here by using setDT then setDF to go from frame to table and back. I suppose the question is if the memory usage is a big enough problem to warrant such changes.

03rcooke commented 1 year ago

The alternative option would be to use tidyverse (e.g., dplyr and tidyr) functions to replace reshape2::dcast, this would likely be less memory intensive than reshape2::dcast, but more memory intensive than data.table. However it would be much easier to implement as it would work with dataframes.

Something like:

spp_vis <- dplyr::arrange(temp, species_name) %>% 
    tidyr::pivot_wider(names_from = species_name, values_from = pres, values_fill = FALSE) %>% 
    dplyr::arrange(visit)

rather than spp_vis <- dcast(temp, formula = visit ~ species_name, value.var = "pres", fill = FALSE, fun=unique)

I'm not sure the arranges are strictly necessary, but this way the outputs are identical

JHHatfield commented 1 year ago

Sounds good, I will have a look. The quick fix for formatOccData works pretty well but when I started to have a look at occDetFunc its not really going to work. I will have a look at memory usage switching occDetFunc over to tidyverse functions. Although what is the plan for the function going forward if you are bringing NIMBLE in?

03rcooke commented 1 year ago

I think most of the code in occDetFunc will stay the same, it'll just be that there is an option to run the model in nimble rather than jags. So I think it's worth thinking about how we could reduce the memory and increase the speed on all the old reshape2 bits of code. There's also this tidyfast package https://github.com/TysonStanley/tidyfast, which has the dt_pivot_wider() function which I think basically runs data.table::dcast, but fits in a pipeline that uses dataframes neater.