frazer-lab / cluster

Repo for cluster issues.
1 stars 0 forks source link

Constant ram use on flh2 #261

Closed billgreenwald closed 5 years ago

billgreenwald commented 6 years ago

Hey Paul (& Hiroko, though I know she's on vacation),

We have like ~110 GB of RAM that is unaccounted for but constantly used on flh2. David and I check all our notebooks and htop, and see like 15GB that should be used.

Any thoughts?

Thanks!

tatarsky commented 6 years ago

Ah, good to know on vacation as I was trying to arrange a drive replacement. So I'll postpone that.

When you measure "ram use" are you looking at "cached" because that is basically the Linux cache memory designed to speed I/O and its "used" but can be pushed out by regular use of memory.

You can flush those bytes to disk but they will come back and its a performance impact. Are you being denied memory use?

tatarsky commented 6 years ago

Also, I'm happy to issue the "drop cache" command to confirm thats what you are looking at but be aware the nature of linux is to always use free memory for something and that caching is vital to performance. (And will re-fill). But your requests should NOT be getting denied by cache memory usage so just confirm thats not happening.

billgreenwald commented 6 years ago

David had issues running something that needed a lot of RAM, and it was only able to use 50% of the RAM on the system before starting to use swap space. I think this is what you mean by "requests getting denied"

As far as the type of RAM, I am going off of htop and the coloring, which is not an exact science, since I can't seem to get the numbers to fall out of the actual Mem% columns from top/htop

Most of the ram used is the green color which is used memory; there is a similar amount of cached but that should be ok to leave (see first comment in this reply tho)

tatarsky commented 6 years ago

Yep. I'm looking. I see a fair chunk of items not in cached memory but don't see who has it.

I see one of @djakubosky process (30291) with a large number of deleted /tmp files still held open. I've seen in some cases such files hold open over time some memory in a weird way. Is that process still active?

python    30291 30461     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30462     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30463     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30464     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30465     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30466     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30467     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30468     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)
python    30291 30469     djakubosky   18u      REG              253,0      4096           40370195 /tmp/ffiVV2JE0 (deleted)

I'll flush the cache and see whats left after it.

billgreenwald commented 6 years ago

That process is still active, but he said it shouldnt have a ton of temp files in it; he only read in two files using jupyter with it.

We arent sure why. We can't kill it right now since it takes a few hours to get it up and running and its currently in use

tatarsky commented 6 years ago

Thats fine. They look small. When I see the 4th column there with large numbers I've seen memory get stuck in what is called a slab cache (another part of Linux memory cache performance items). Leave it going. I'm asking the kernel to drain its caches (takes awhile)

billgreenwald commented 6 years ago

sounds good.

The system is already down to 12.3G so it looks like that fixed it

tatarsky commented 6 years ago

So there may be a form of slab or open file leak in that or another process. Lets see what happens. Leave this open. You should rarely have to flush the cache.

billgreenwald commented 6 years ago

Whats the way for us to check this in the future before looping you in?

tatarsky commented 6 years ago

Hard to say at the moment as I don't know what was involved.

billgreenwald commented 6 years ago

Sounds good. Will keep you posted.

tatarsky commented 6 years ago

Took a quick look before I call it a day today. I show some continued growth in the "used" category but I'll try some periodic lower level command loops to see if I can spot what is I suspect leaking in some way.

tatarsky commented 6 years ago

Noting has not re-occurred as far as I can tell. Monitoring a bit longer.

tatarsky commented 6 years ago

Leaving open one more week while I travel.

tatarsky commented 5 years ago

This appears to have not re-occured at least from Ganglia's point of view. Closing for now but I'll be reviewing a periodic memory flush.