LowellObservatory / NightShift

Collection of sub-modules used in the NightWatch (and other Night*) project
Mozilla Public License 2.0
1 stars 1 forks source link

Fix GOES memory leak #11

Closed astrobokonon closed 4 years ago

astrobokonon commented 4 years ago

I have been fighting this for a while without much success.

image

It looks to be a real leak, since the system load rises with this as well. It really does seem like there is something not letting go rather than just a caching artifact.

I've tried to kill as much as I could figure out in the plot code in b75dbc453ee70e824b97e14cfb7fb8d013d084f7, but it still has had no appreciable effect.

astrobokonon commented 4 years ago

Looking at the plot, the severity of it has decreased since the beginning of October. Looking back even further, the trouble really started around 2019-09-10. I must have done an update there, and I've been regretting it ever since.

image

I really, really, really need better versioning control of the container components!

astrobokonon commented 4 years ago

Since looper_goes16aws.py is just an infinite loop, the culprit must be inside the while True block somewhere? It never actually exits that block so whatever is in there is probably always deemed in scope and not garbage collected.

I'll try to think if there's a way I can add in some additional memory logging in-process. I tried to profile locally but I didn't see anything. It might be the kind of thing I need to leave running for a day or two before I get the trend though.

astrobokonon commented 4 years ago

Easiest thing to do right now is to clear the board; I just stopped all containers, and deleted all images, and I'm rebuilding. Beyond that I'm going to have to dig in to some stuff that can periodically dump memory maps and stacks I suppose and trace back from there. It's been forever since I last used objgraph and things like it (https://mg.pov.lt/blog/hunting-python-memleaks.html).

(well, I didn't stop sysTools; I wanted portainer still up)

astrobokonon commented 4 years ago

Also - once the 'pytest' image is built, I should be able to start up a container of it and then just dump the various python component versions and then I'll be able to lock the ones I'm suspicious of. Didn't think of that before for some reason.

astrobokonon commented 4 years ago

One thing I did remember to do this time is to change the continuumio/miniconda3 base image to be a specific tagged version, and I ditched a 'conda update conda' directive in the LIGDockerfile. That should make this much less of a moving target at least.

astrobokonon commented 4 years ago

No dice with the rebuild, same slow rise over 12 hrs

image

The plateau between 20191017 06:00 and 09:00 is utterly mysterious. The logs indicate it was still chugging away in the background, reprojecting and moving around images as usual. But the slope is waaaay flatter.

Steps are definitely correlated with times of actual reprojection. The dips generally correlate with entering the sleep state.

astrobokonon commented 4 years ago

Did some light refactoring and still no big improvement seen in local testing. I'm going to try to remove the plot making entirely, to see if I can try to isolate this to a particular package. I still suspect the reprojection the most, but cartopy and/or matplotlib could be involved even though a very similar piece of code is at work in the radar imagery and there's no leak there...

astrobokonon commented 4 years ago

Testing locally with some more refactoring AND pyresample@1.13.2, and testing in production with just pyresample@1.13.2. Going to let both stew overnight and see who comes out ahead in the memory game.

So far, the refactoring is coming out slightly ahead! Will be interesting to see.

astrobokonon commented 4 years ago

Draft changes now in c1c34d27928fb5d516bc57abcd1561f37f26247b (but on a new branch). Need to remember that the sleep timer in this one is super short before I merge it back down.

astrobokonon commented 4 years ago

I got good results once I fiddled more with the scope of dat I think; my guess is that the references to various variables in it was keeping the data in scope far longer than it needed to be? Looking at the documentation of netCDF4.Dataset() implies that's the case too; see the keepweakref argument here: https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.Dataset

astrobokonon commented 4 years ago

After several hours of both running:

refactor: 0.464 GiB prod: 1.040 GiB

I win!

astrobokonon commented 4 years ago

My local test was dead solid all night long, and I had already claimed victory last night and put this into production since the weather was shit. The flatline is from when I put it in last night until just now.

image

datayes