Blunde1 / agtboost

Adaptive and automatic gradient boosting computations.
MIT License
67 stars 11 forks source link

Memory leaks in gbt.load() #55

Open sondreus opened 2 years ago

sondreus commented 2 years ago

I think I have detected a memory leak in gbt.load. Could be I'm failing to understand how R handles memory (and the imperfections of that process), but maybe adding a manual garbage collection in the C subroutine could be the way to solve this?

Minimal replication:

x <- runif(50000, 0, 10)
y <- rnorm(50000, x, 1)

mod <- gbt.train(y, as.matrix(x))
gbt.save(mod, 'gbt_model.gbt')

Memory usage (by the R process, not the objects in R workspace) increments on my system by roughly 1mb every time the below line is run:

mod <- gbt.load('gbt_model.gbt')
eddelbuettel commented 2 years ago

Howdy. @sondreus had reached out to me as we have had a few earlier DMs about his most excellent model and its use of agtboost.

The package is in good standing on CRAN, and Prof Ripley keeps us all honest by running checks with the (very powerful) valgrind tool on his instances. I have been at the receiving end of such "valgrind detected a leak" notes myself.

And agtboost comes out clean as a whistle.

I can replicate that with a quick R -d valgrind -f issue55.R where the R file re-groups the demo above. (Even with the load() part in a loop for extra effort.) No leak ... in the sense of memory being allocated and allocated and allocated and never freed.

The fact that @sondreus sees increase memory usage may have mean that R assigns new objects it keeps track of. It is tricky to disambiguate that. I'll think some but at glance the package seems to not be doing anything. Happy to help some more and chat but no smoking gun as far as I can tell.

PS I always find it instructive to have a baseline. So putting, say, these two lines in a file and running them as above with valgrind does generate the output below (among much more output)

Rcpp::cppFunction("double leakMe() { double *p = new double[1000]; return p[0]; }")
leakMe()

yields

==4133893== LEAK SUMMARY:                                                                                
==4133893==    definitely lost: 8,000 bytes in 1 blocks                
==4133893==    indirectly lost: 0 bytes in 0 blocks                                                      
==4133893==      possibly lost: 0 bytes in 0 blocks                                                      
==4133893==    still reachable: 56,622,975 bytes in 11,188 blocks                                        
==4133893==                       of which reachable via heuristic:                                      
==4133893==                         newarray           : 4,264 bytes in 1 blocks
==4133893==         suppressed: 0 bytes in 0 blocks                                                      
==4133893== Reachable blocks (those to which a pointer was found) are not shown.
==4133893== To see them, rerun with: --leak-check=full --show-leak-kinds=all

We have none of that in agtboost. Which is good!

sondreus commented 2 years ago

Thanks @eddelbuettel!

Look forward to hear from @Blunde1 who might look into this too. My guess is that my language is imprecise, so that there is no memory leak in the proper sense, just that when run many times, gbt.load takes up ever more memory (I got a colleague to replicate this to be sure), even as the loaded objects are no longer usable by any process from within R.

All the best, Sondre

eddelbuettel commented 2 years ago

Yes. R uses copy-on-write internally so a changed object always becomes a new object (where plain assignment copied A <- B for whatever B do not yet create a new one but keep a reference).

I had massive headaches with things like too in the past (though maybe mostly in the distant past, on Windows, with smaller memory systems) so I don't have a great set of recommendations. A simple rm(A); gc() always looks promising but in practice may not work.

In theory, "maybe" we could bring the loading of the model to the C++ side (where we have finer control) and use accessors for the data. In practice that maybe harder because ... these model structures are larger. But I guess they are regular?

We could possibly try alternate serialization from the C++ side too. I'd be game for helping a little but I too would love to hear what @Blunde1 thinks as he of course knows his model inside out...

eddelbuettel commented 2 years ago

As the cliche goes, I just had some more thoughts about this while on the morning run (and in the shower, cliche of cliches).

To clarify: @sondreus you have this issue when bootstrapping / looping over N model samples and fits, correct? I may have an idea or two (which do not require @Blunde1 to change anything) so maybe you and can (and should) take this to a scrap repo to noodle around?

sondreus commented 2 years ago

@eddelbuettel Yes indeed. I've "solved" the issue by dividing this portion of the update process into two batched, and moving the offending portion to a separate script and running it in a new R process which closes upon completion (so that the rest of the update process won't run out of memory).

https://github.com/TheEconomist/covid-19-the-economist-global-excess-deaths-model/blob/main/scripts/0_excess_deaths_global_estimates_autoupdater_B.R

eddelbuettel commented 2 years ago

That is pretty much what I was about to suggest. I happen to use external scripts lot and really like batching for R -- and wrote littler to give us r (at least on Linux, on macOS one has rename to lr or alike as it thinks r == R thanks to the genuises in Cupertino). And if one does not like that / is on a different OS then Gabor et al also wrote callr which should do the same. The key is to process each i-th R job to really get the memory back.

Blunde1 commented 2 years ago

@sondreus Thanks, I am looking into this. I observe the same as you, and also from running mod <- gbt.train(y, as.matrix(x)) multiple times the memory seems to increase. I will "noodle around" some myself to try to find the cause (the only thing I can think of right now is perhaps improper use of new keyword on C++ side, or something similar...?) and a proper fix. For the time being, it seems you and @eddelbuettel have a workaround in place (nice to be aware of littler!).

@eddelbuettel thanks for the checks and running valgrind (and also Rcpp and its extensions)!

In theory, "maybe" we could bring the loading of the model to the C++ side (where we have finer control) and use accessors for the data. In practice that maybe harder because ... these model structures are larger. But I guess they are regular?

We could possibly try alternate serialization from the C++ side too. I'd be game for helping a little but I too would love to hear what @Blunde1 thinks as he of course knows his model inside out...

I believe the (de-)serialization can be improved upon, and would love to hear your thoughts on it. The models are regular in the sense that they are deterministic and in essence consists of nodes that are easily traversable (directed graph), thus the same logic may be applied all over. Do you know of any code/packages from which inspiration could/should be drawn on? In particular, are there any "using accessors for the data"-examples you could share?

eddelbuettel commented 2 years ago

Yes, it's a weird issue and it would be good to get a handle on it. We may still be doing something wrong but it apparently it is not something obvious.

Serialization is a fantastic (and complicated !!) topic. Per se you are doing nothing wrong, but of course writing "in text" is typically as seen as the most expensive way (formating out the way out, parsing on the way in). It is also human-readable and portable so there is that on the plus side. If you are only serializing 'for yourself here' (i.e. agtboost writes and reads) then saveRDS / loadRDS will be faster at the expense of human readability (and portability). I personally like fst an awful lot: it is very fast and backed by a library part one could use elsewhere (directly from C++, to interface Python, ...) And then there are msgpack and feather and writing to database (SQLite is good for this too) and more. You can go and on.

But you don't have to. Simple is good too. For now, I think, we need to understand why/how the memory use grows. Maybe try another baseline where we just repeatedly assign a matrix in which we change one or two values (to trigger R's copy-on-write) ? I would think that by assigning a new one in loop 'i' it would free / recover the memory from iteration 'i-1'. Why this does not happen here is odd. (And I think I am more puzzled about R here than about agtboost but we should double check).

Blunde1 commented 2 years ago

@eddelbuettel Thanks! For the memory issues I wanted to see if this was something that happened purely with agtboost, or if I would observe the same behaviour of increasing memory with a more simple C++ class exposed with RcppModules. I thus stole the yada example from the vignette

// yada.cpp
#include <Rcpp.h>
class World {
public:
    World() : msg("hello") {}
    void set(std::string msg) { this->msg = msg; }
    std::string greet() { return msg; }
private:
    std::string msg;
};
RCPP_MODULE(yada){
    using namespace Rcpp;
    class_<World>("World")
        .constructor()
        .method("greet", &World::greet)
        .method("set", &World::set)
    ;
}

and then use the set() function from R where a very large string (approx 1GB) is passed to the C++ object.

# yada.R
Rcpp::sourceCpp("yada.cpp")
n <- 2e8
long_string <- strrep("hello",n) # approx 1 GB
# Running the next two lines multiple times, memory seems to be incremented
world <- new(World)
world$set(long_string)

From running the last two lines incrementally I seem to experience the same memory issue. This is seen from visual inspection in RStudio, and from inspecting the rsession object in the Activity Monitor (MacOS). I will see if this is replicated on other systems. Please also tell me if I am doing something wrong here, and it would be interesting if you also observe this behaviour.

If you also indeed observe the above behaviour, I would assume it is not a problem with agtboost but rather some weird (because I do not understand it 😄) behaviour on the language-memory-handling side. Which seems to be aligned with your comments above.

eddelbuettel commented 2 years ago

That's exactly what's needed and maybe also a possible fix I had not tought about.

The long story is that some time many years ago (but after I had started using R too) the Core team changed string representation and made it much better. (It was fairly awful before and one of the reasons for the old bias of using matrices over data.frames or anything indexed by rownames as those used to be expensive.) String got a better hashed representation, but that representation is still different from numbers (int, double, ...) which are all consecutive for a vector, CharacterVectors are essentially pointers to to individual 'words' (the individual string entries).

Moreover, these are not shared between C++'s std::string and R''s whereas we shared int or double vectors! So I think when we invoke new(World) twice R cannot know and does not know it already has the vector and creates a new one.

This may be different if we used Rcpp::CharacterVector. That is more of a promise and hunch from me than a firm guarantee.

But it gets us back to @sondreus issue because your serialization is ... vectors of strings, right? We might avoid it all if we went to storing what are for just numbers as numbers. That may be more work, and more fragile, so we would make sure it is worth it.

But great existence proof above. This may lead us down a road to improve the issue.

Blunde1 commented 2 years ago

@eddelbuettel thanks, the long story and its insights is appreciated! I think there is no way I would have understood this on my own.

I changed the attribute of World (in the example above) to a Rcpp::NumericVector x and correspondingly in the World::set() function, and from the R side passed a very large numerical vector. And now, the garbage collection (through gc()) indeed seems to work perfectly.

For agtboost, a model object holds the loss function as an std::string object, but this could (and probably should) just be an enum. All other attributes are numerical. I will take a look and perhaps do some refactoring so that the model objects are purely numerical.

eddelbuettel commented 2 years ago

Really nice work @Blunde1! I think the 'how strings are represented' is a bit of an insider story I have happened to have picked up on.

I'd be up for cotinuing to discuss and prototype (and help, once this vacation is over,,,,) with a 'more efficient' serializartion approach. This is essentially undocumented stuff so if we can put up a better reference implementation it may help others.

Blunde1 commented 2 years ago
eddelbuettel commented 2 years ago

Sorry for the silence, but had some time off. I just set up a very minimal check to see if our hypothesis concerning string representation holds. I made these two minimal changes to save and load as rds if the filename ends in rds (very simplistic ...)

modified   R-package/R/gbt.load.R
@@ -23,7 +23,12 @@ gbt.load <- function(file)
 {
     # check valid file path

-    mod <- new(ENSEMBLE)
-    mod$load_model(file)
-    return(mod)
-}
\ No newline at end of file
+    if (grepl("rds$", file)) {
+        mod <- readRDS(file)
+        mod
+    } else {
+        mod <- new(ENSEMBLE)
+        mod$load_model(file)
+        mod
+    }
+}
modified   R-package/R/gbt.save.R
@@ -47,7 +47,10 @@ gbt.save <- function(gbt_model, file)
     if(length(error_messages)>0)
         stop(error_messages)

-    # Store model in file
-    gbt_model$save_model(file)
+    ## Store model in file
+    if (grepl("rds$", file))
+        saveRDS(gbt_model, file)
+    else
+        gbt_model$save_model(file)

 }

And based on a casual check of loading the (same) file fifty times, memory use seems stable as it should be. Let me know if this helps, I can probably expand on it.

Blunde1 commented 2 years ago

@eddelbuettel thanks -- time off is time off which needs to be spent wisely I am realizing more and more. Here we just had our few good days of weather for the summer, which warranted time outside.

I will try this as soon as possible. It would be amazing if this indeed works! I will still try to change the model class as intended by introducing an enum loss-function attribute, and some general refactoring, but perhaps that can be done after introducing your solution as it might very well resolve the issue. 👏

As a side-note I remember "studying up" on serialization and in particular for C++ objects from R, and thought that I had unsuccessfully tried the save/loadRDS format at that time. Thus I am very glad if this indeed works, and appreciate the help immensely. 😄

eddelbuettel commented 2 years ago

It's easy to get lost in the weeds there. I just accessed RDS input/output from R. But as you mention it, I once needed it for Redis so I extracted the (otherwise 'hidden' and unexported) R code here in this repo (and on CRAN) but we should not need this. You know your model data so much better than I do but I tried a few tricks over the years with serialization -- if RDS works for you I'd stick with that for now.

Blunde1 commented 2 years ago

@eddelbuettel I created an early PR based on your suggestion here

I get an Error when running the suggested code

> mod <- gbt.load('gbt_model.rds')
> mod
Error in sprintf("C++ object <%s> of class '%s' <%s>", externalptr_address(pointer),  : 
  external pointer is not valid

which seems to be similar to my earlier attempts in the past, I think.

I googled a little bit, and stumbled upon (again) the ever-helpful Rcpp Gallery with this post on serialization of C++ objects. Perhaps I should do this first and then call save/loadRDS? I would be surprised if this did not work, but it would however require further dependencies...

While thinking about the serialization I will continue to fix the string issues in #58 as I believe this might also resolve the memory issues.

eddelbuettel commented 2 years ago

Howdy. Happy to get more involved and send you what I had as a PR (or push it to a fork, or ...) but can you in a first instance debug locally? What I offered (to switch based on the final characters in the file) may be 'too loose'. It may be simpler to use your load and save from ascii as before, and to explicitly call readRDS() and saveRDS().

The Rcpp Gallery piece uses a different serialization (from Boost), we may not want that. I do have a helper package RApiSerialize (on CRAN for many years, I happen to have just sent a minor update to CRAN) but that is only needed if you want to serialize from C++ which we may not need here (but I guess we could).

The error message above seems to suggest that some wires are crossed for the Rcpp Module part.