GC doesn't seem aggressive enough at times

rdp commented 7 years ago

See if this one seems to use up an inordinate amount of RAM for me (it seems to for me on all OS's I tried anyway):

class Foo
  def finalize
    # Invoked when Foo is garbage-collected
    a = 3
    "bye #{3+4}"
  end

end

# Prints "Bye bye ...!" for ever
loop do
   Foo.new
end

I wonder if it's related to boehm gc finalization cycle or something. Let me know if reproducible. Thank you.

ozra commented 7 years ago

(quote from comment in https://github.com/kemalcr/kemal/issues/323)

2.7mb. I'm surprised the memory in a fibre isn't cleaned up when an exception is called.

Are you saying that, when code raises within a fiber, it's not cleaned up? The stack there is allocated via mmap - outside of GC. I know there are a few issues reporting mysterious leaks, that I still think are open.

If I understood you right, you might have found a long haunting ghost!

Ping @waj @asterite just in case, otherwise - sorry to derail!

ysbaddaden commented 7 years ago

Exceptions in spawn are rescued and a message is printed about it. The the fiber normally terminates, that shouldn't keep the fiber's stack allocated.

ozra commented 7 years ago

Alright @ysbaddaden - it was a longshot of unlikeliness :) !

crisward commented 7 years ago

I've done some experiments with the GC and spawn, with libgc compiled with --enable-munmap=3

str = "hello world " * 100000
p "string size #{str.size}"
str = nil

counter = 0
loop do
  counter += 1
  puts "#{counter}. heap #{GC.stats.heap_size} free_bytes #{GC.stats.free_bytes} bytes_since_gc #{GC.stats.bytes_since_gc}"
  sleep 1.seconds
  GC.collect
end

This outputs, which is what I'd expect

"string size 1200000"
1. heap 1286144 free_bytes 24576 bytes_since_gc 1211920
2. heap 1286144 free_bytes 1220608 bytes_since_gc 208
3. heap 1286144 free_bytes 1220608 bytes_since_gc 208
4. heap 1286144 free_bytes 1220608 bytes_since_gc 208
5. heap 1273856 free_bytes 1208320 bytes_since_gc 208
6. heap 65536 free_bytes 0 bytes_since_gc 208
7. heap 65536 free_bytes 0 bytes_since_gc 208

However if I do this and create large var in spawn.

spawn do
  str = "hello world " * 100000
  p "string size #{str.size}"
  sleep 1.seconds
  str = nil
  GC.collect
end

counter = 0
loop do
  counter += 1
  puts "#{counter}. heap #{GC.stats.heap_size} free_bytes #{GC.stats.free_bytes} bytes_since_gc #{GC.stats.bytes_since_gc}"
  sleep 1.seconds
  GC.collect
end

The output is...

1. heap 65536 free_bytes 28672 bytes_since_gc 2960
"string size 1200000"
2. heap 1286144 free_bytes 20480 bytes_since_gc 208
3. heap 1286144 free_bytes 24576 bytes_since_gc 208
4. heap 1286144 free_bytes 24576 bytes_since_gc 208
5. heap 1282048 free_bytes 20480 bytes_since_gc 208
6. heap 1282048 free_bytes 20480 bytes_since_gc 208
7. heap 1261568 free_bytes 0 bytes_since_gc 208
8. heap 1261568 free_bytes 0 bytes_since_gc 208
9. heap 1261568 free_bytes 0 bytes_since_gc 208
10. heap 1261568 free_bytes 0 bytes_since_gc 208

So it's freeing some of the bytes, but not all. I realise this is a bit of a naive example. I'm just trying to get a better understanding of this before I bring exceptions and writing IO into this.

crisward commented 7 years ago

Ok, this is weird, but maybe someone else has an idea... (I'm guessing it a timing issue)

If I repeat the test above but add a line to debug in crystal's fiber.cr libraray (line 92)

  protected def self.allocate_stack
    p "fiber allocate_stack"
    @@stack_pool.pop? || LibC.mmap(nil, Fiber::STACK_SIZE,
      LibC::PROT_READ | LibC::PROT_WRITE,
      LibC::MAP_PRIVATE | LibC::MAP_ANON,
      -1, 0).tap do |pointer|
      raise Errno.new("Cannot allocate new fiber stack") if pointer == LibC::MAP_FAILED
      {% if flag?(:linux) %}
        LibC.madvise(pointer, Fiber::STACK_SIZE, LibC::MADV_NOHUGEPAGE)
      {% end %}
      LibC.mprotect(pointer, 4096, LibC::PROT_NONE)
    end
  end

This happens - the memory is collected, see free byes and heap.

"fiber allocate_stack"
"fiber allocate_stack"
"fiber allocate_stack"
1. heap 65536 free_bytes 4096 bytes_since_gc 13040
"string size 1200000"
"fiber allocate_stack"
2. heap 1372160 free_bytes 106496 bytes_since_gc 208
3. heap 1372160 free_bytes 1310720 bytes_since_gc 208
4. heap 1372160 free_bytes 1310720 bytes_since_gc 208
5. heap 1368064 free_bytes 1306624 bytes_since_gc 208
6. heap 1368064 free_bytes 1306624 bytes_since_gc 208
7. heap 61440 free_bytes 0 bytes_since_gc 208
8. heap 61440 free_bytes 0 bytes_since_gc 208
9. heap 61440 free_bytes 0 bytes_since_gc 208
10. heap 61440 free_bytes 0 bytes_since_gc 208

crisward commented 7 years ago

This also works, adding a p to spawn. Weird.

def spawn(*, name : String? = nil, &block)
  p "spawn"
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

benoist commented 7 years ago

What happens if you output something in

self.stack_pool_collect

This method is called from a Fiber every 5 seconds if I'm not mistaken.

crisward commented 7 years ago

Nothing, I tried that. Also interestingly this does nothing...

def spawn(*, name : String? = nil, &block)
  STDOUT.puts "spawn"
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

The only difference between puts, and p seems to be prettyprint is invoked. I'm guessing something in there is helping.

crisward commented 7 years ago

yep - PrettyPrint.format("spawn", STDOUT, 79) helps memory drop too.

benoist commented 7 years ago

And replacing it with Fiber.yield does that also release the memory?

crisward commented 7 years ago

Fiber.yield also does nothing (to release memory)...

benoist commented 7 years ago

maybe something happening in the printer.flush call performed by the pretty print...

crisward commented 7 years ago

I was looking at that, not sure why it would... I also thought it may be a timing issue, so tried sleep 1.millisecond but that didn't do anything, event tried sleep 1.second.

benoist commented 7 years ago

def spawn(*, name : String? = nil, &block)
  "a".inspect
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

just calling inspect also seems to work, this is what pretty print does

crisward commented 7 years ago

Same here. Next, what does inspect do?

benoist commented 7 years ago

def spawn(*, name : String? = nil, &block)
  String.build {}
  fiber = Fiber.new(name, &block)
  Scheduler.enqueue fiber
  fiber
end

also works

crisward commented 7 years ago

All of these things seem to push initial memory use up from 1286144 to 1372160. Not sure if that is significant?

benoist commented 7 years ago

well the string builder allocates some capacity with GC.malloc_atomic(capacity.to_u32).as(UInt8*)

crisward commented 7 years ago

Adding GC.malloc_atomic(2048.to_u32).as(UInt8*) works too.

crisward commented 7 years ago

The size of the string created inside the fiber seems to have no effect on the gc collecting. But adding GC.malloc_atomic(2048.to_u32).as(UInt8*) always helps the collection.

ozra commented 7 years ago

So GC doesn't collect / cleanup until called upon to do new allocation - or is it definitely only for plain value-alloc (malloc_atomic)?

crisward commented 7 years ago

Having GC.malloc_atomic(2048.to_u32).as(UInt8*) added to crystal's spawn method seems to cause the fibers memory to be cleaned up. Also messing with exceptions inside the fiber, they seem to add a further 4096 bytes to the heap for each exception. I'm not sure of the significance of that number. I'm still playing with various scenarios to get something which approximates the issue I'm having inside kemal with errors in fibers.

ozra commented 7 years ago

Out of curiosity: does both GC.malloc_atomic(2048) and GC.malloc(2048) help? And size? Or is there a minimum required to trigger?

benoist commented 7 years ago

Another interesting thing to point out:

benoist@benoist-i7$ crystal test.cr
1. heap 65536 free_bytes 28672 bytes_since_gc 2960
"string size 1200000"
2. heap 1286144 free_bytes 20480 bytes_since_gc 208
3. heap 1286144 free_bytes 24576 bytes_since_gc 208
4. heap 1286144 free_bytes 24576 bytes_since_gc 208
5. heap 1286144 free_bytes 24576 bytes_since_gc 208
^C[~/code]
benoist@benoist-i7$ crystal test.cr --release
1. heap 65536 free_bytes 28672 bytes_since_gc 2960
"string size 1200000"
2. heap 1286144 free_bytes 16384 bytes_since_gc 208
3. heap 1286144 free_bytes 1224704 bytes_since_gc 208
4. heap 1286144 free_bytes 1224704 bytes_since_gc 208
5. heap 1286144 free_bytes 1224704 bytes_since_gc 208

So in release mode allocating new memory before spawning is not required.

crisward commented 7 years ago

@benoist Thanks for your help with this. Ultimately I'm trying to track down where memory is leaking to cause the issue I'm having in Kemal. Running GC.malloc_atomic(2048) does seem to help in this simple test case, but if an exception is thrown inside the fiber I get the same problem of a big spike in memory. I thought it may be to do with the socket not being closed and flushed, but I can't find any evidence of this, I've tried patching HTTP::Server::Response.write and adding a rescue with close and flush but to no effect.

Also my leak is happening with the --release flag too, which points to it being some cause other than fibers, though it could be a contributing factor.

My leaks are always proceeded with a Exception: Error writing file: Broken pipe (Errno) as detailed here

I could do with input from someone with better understanding of exactly how fibers and io work, especially in connection with the GC. @asterite @ysbaddaden ?

Sija commented 7 years ago

@crisward I've seen evidence of Exception: Error writing file: Broken pipe (Errno) in cases when connections were interrupted.

crisward commented 7 years ago

It's easy enough to recreate - using something like wrk -c2 -t2 -d1 http://127.0.0.1:3000. It seems to leak a noticeable amount when you're returning lots of data, or there are lots of things going on in the fiber.

wrk cuts out mid request when the time runs out, so creates the bug. Apache bench does full requests each time so doesn't create the leak, event with similar loads.

rdp commented 7 years ago

The tricky part here is it may not be a memory "leak" per se (if it is, run valgrind ./your_crystal_app and it will report a memory leak) but it's possibly a "left over reference" where something is still pointing to the big block, so it cannot be reclaimed. See also the quora link pointed to before. So it's a bit tricky to find "what's pointing to what" as it were...though maybe possible :)

On Tue, Mar 14, 2017 at 7:22 AM, Cris Ward notifications@github.com wrote:

It's easy enough to recreate - using something like wrk -c2 -t2 -d1 http://127.0.0.1:3000. It seems to leak a noticeable amount when you're returning lots of data, or there are lots of things going on in the fiber.

wrk cuts out mid request when the time runs out, so creates the bug. Apache bench does full requests each time so doesn't create the leak, event with similar loads.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/3997#issuecomment-286418874, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAw0HMULjzic2qNSmBIlmIzJ4Lz8bgdks5rlpScgaJpZM4L21j8 .

crisward commented 7 years ago

@rdp I don't disagree. I've done check with apple's instruments and no leaks appear. Finding what's pointing where though is a tough one.

ozra commented 7 years ago

not to be nitpicking, @rdp, but a left over ref causing unwanted heap growth is a leak in my book, just like circular refs for gc's not coping with those, etc., whatever B-)

@crisward - did you create an absolute minimum case of code to reproduce the behaviour on cut off connections? And for the mystic cleanup triggers: did you try if standard "conservative" malloc also triggers just like malloc_atomic?

(the more clues the better)

crisward commented 7 years ago

@ozra GC.malloc doesn't seem to work. It free's some memory, but not all unused like, malloc_atomic

Not got a very minimal use case, I'd say that would have be just using the standard library. Will work on it when I get chance. Being looking into getting valgrind working. Think I'll have to do it in a vm as it not mac os friendly (well not sierra at least).

rdp commented 7 years ago

It's either that or my theory was that a "full" GC wasn't being called for some odd reason [poor math? too many small mallocs? might have to dive into the bdw-gc and see if that's the case and if so why, etc.] Good luck!

On Tue, Mar 14, 2017 at 4:48 PM, Cris Ward notifications@github.com wrote:

@ozra https://github.com/ozra GC.malloc doesn't seem to work. It free's some memory, but not all unused like, malloc_atomic

Not got a very minimal use case, I'd say that would have be just using the standard library. Will work on it when I get chance. Being looking into getting valgrind working. Think I'll have to do it in a vm as it not mac os friendly (well not sierra at least).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/3997#issuecomment-286585747, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAw0Fn_LEAQLZYziIvXN60vuSJnPM-Tks5rlxk1gaJpZM4L21j8 .

crisward commented 7 years ago

Super simple reproduction. Though please correct me if I'm misunderstanding the data.

require "http/server"

server = HTTP::Server.new(8080) do |context|
  context.response.content_type = "text/plain"
  sleep 100.milliseconds # slow down to simulat db stuff
  bigstr = "Hello World! " * 100000
  context.response.print bigstr
end

spawn do
  counter = 0
  loop do
    counter += 1
    puts "#{counter}. heap #{GC.stats.heap_size} free_bytes #{GC.stats.free_bytes} bytes_since_gc #{GC.stats.bytes_since_gc}"
    GC.collect
    sleep 1.seconds
  end
end

puts "Listening on http://127.0.0.1:8080"
server.listen

2 Tests, one using ab apache bench and one using wrk. ab does not cause broken pipe errors because it runs for n number of requests. wrk runs for n number of seconds and cuts off, mid request if necessary so causes broken pipes.

The results are

ab -c10 -n100 http://127.0.0.1:8080/

Listening on http://127.0.0.1:8080
1. heap 131072 free_bytes 49152 bytes_since_gc 24096
2. heap 131072 free_bytes 53248 bytes_since_gc 320
3. heap 131072 free_bytes 53248 bytes_since_gc 208
4. heap 131072 free_bytes 53248 bytes_since_gc 208
5. heap 81920 free_bytes 4096 bytes_since_gc 208
6. heap 77824 free_bytes 0 bytes_since_gc 208
7. heap 19546112 free_bytes 5500928 bytes_since_gc 10573104
8. heap 19546112 free_bytes 6930432 bytes_since_gc 10465904
9. heap 19546112 free_bytes 18956288 bytes_since_gc 2992
10. heap 19546112 free_bytes 19378176 bytes_since_gc 112
11. heap 19546112 free_bytes 19378176 bytes_since_gc 112
12. heap 19542016 free_bytes 19374080 bytes_since_gc 112
13. heap 16834560 free_bytes 16666624 bytes_since_gc 112
14. heap 167936 free_bytes 0 bytes_since_gc 112
15. heap 167936 free_bytes 0 bytes_since_gc 112
16. heap 167936 free_bytes 0 bytes_since_gc 112

wrk -c10 -t10 -d1 http://127.0.0.1:8080/

1. heap 131072 free_bytes 49152 bytes_since_gc 24096
2. heap 131072 free_bytes 53248 bytes_since_gc 320
3. heap 19550208 free_bytes 6156288 bytes_since_gc 10416224

Unhandled exception in spawn:
Error writing file: Broken pipe (Errno)
0x10fda8b32: *CallStack::unwind:Array(Pointer(Void)) at ??
0x10fda8ad1: *CallStack#initialize:Array(Pointer(Void)) at ??
0x10fda8aa8: *CallStack::new:CallStack at ??
0x10fd9df61: *raise<Errno>:NoReturn at ??
0x10fe79914: *TCPSocket+@IO::FileDescriptor#unbuffered_write<Slice(UInt8)>:Int32 at ??
0x10fe795d1: *TCPSocket+@IO::Buffered#flush:TCPSocket+ at ??
0x10fe79cab: *TCPSocket+@IO::Buffered#close:Nil at ??
0x10fe74b37: *HTTP::Server::RequestProcessor#process<(OpenSSL::SSL::Socket::Server | TCPSocket+), (OpenSSL::SSL::Socket::Server | TCPSocket+), IO::FileDescriptor>:Nil at ??
0x10fe6da65: *HTTP::Server::RequestProcessor#process<(OpenSSL::SSL::Socket::Server | TCPSocket+), (OpenSSL::SSL::Socket::Server | TCPSocket+)>:Nil at ??
0x10fe6d986: *HTTP::Server#handle_client<(TCPSocket+ | Nil)>:Nil at ??
0x10fda6ce6: ~procProc(Nil)@/usr/local/Cellar/crystal-lang/0.21.1/src/http/server.cr:163 at ??
0x10fdb2ec4: *Fiber#run:(IO::FileDescriptor | Nil) at ??
0x10fda3199: ~proc2Proc(Fiber, (IO::FileDescriptor | Nil))@/usr/local/Cellar/crystal-lang/0.21.1/src/fiber.cr:29 at ??
(removed others for brevity)

4. heap 19562496 free_bytes 7094272 bytes_since_gc 10748320
5. heap 19562496 free_bytes 17895424 bytes_since_gc 112
6. heap 19562496 free_bytes 18034688 bytes_since_gc 112
7. heap 19562496 free_bytes 18034688 bytes_since_gc 112
8. heap 18808832 free_bytes 17281024 bytes_since_gc 112
9. heap 12140544 free_bytes 10612736 bytes_since_gc 112
10. heap 1527808 free_bytes 0 bytes_since_gc 112
11. heap 1527808 free_bytes 0 bytes_since_gc 112
12. heap 1527808 free_bytes 0 bytes_since_gc 112
13. heap 1527808 free_bytes 0 bytes_since_gc 112
14. heap 1527808 free_bytes 0 bytes_since_gc 112
...
141. heap 1527808 free_bytes 0 bytes_since_gc 112

After broken pipes server shows 1.5mb of memory usage vs 0.16mb with ab and no broken pipes. Also the GC is reporting no free_bytes.

Sija commented 7 years ago

Correct me if I'm wrong, but doesn't spawn from HTTP::Server#listen prevents rescue in HTTP::Server::RequestProcessor#process from working?

ysbaddaden commented 7 years ago

@crisward what happens with --no-debug? That shouldn't have any impact unless the stacktrace is printed. We currently load and decode the DWARF sections (function names, file line numbers) once, then memorize it globally, in order to speed up further stacktrace prints (no re-load of the file, no re-decode of DWARF sections). Used memory shouldn't grow anymore after the first stacktrace is printed.

crisward commented 7 years ago

@ysbaddaden --no-debug doesn't seem to have an effect. However there is something odd going on. Sporadically the process can run fine after being hit by wrk, but most of the time I get an effect similar to the above. Just to put this into context, my live server is going from 3mb to 60mb of heap after one or two broken pipes in succession, there is a lot more code running / databases etc but I'm still surprised it can almost instantly grow 20x.

ozra commented 7 years ago

I digged around a bit in boehm over lunch, but didn't find anything obvious concerning free'ing vis a vis atomic allocs. But it tries to make shortcuts to keep collection time down, doing partial ones sometimes. Can only speculate that some flag is set on kind-no-ptr when doing the atomic which triggers a more detailed collect. But these speculations aren't helpful in any way, so.. :-/

rdp commented 7 years ago

I believe there is the option in bdwgc as to whether it collects "as it goes" or not might be interesting to try also experimenting with that FWIW...

On Thu, Mar 16, 2017 at 8:34 AM, Oscar Campbell notifications@github.com wrote:

I digged around a bit in boehm over lunch, but didn't find anything obvious concerning free'ing vis a vis atomic allocs. But it tries to make shortcuts to keep collection time down, doing partial ones sometimes. Can only speculate that some flag is set on kind-no-ptr when doing the atomic which triggers a more detailed collect. But these speculations aren't helpful in any way, so.. :-/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/3997#issuecomment-287076316, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAw0N9KXznk9q40MyKgRJYEHzbuulTWks5rmUhqgaJpZM4L21j8 .

crisward commented 7 years ago

@ysbaddaden is there any way to free memory used in memorizing the stack trace? Just wondering to what extent this is effecting heap allocations.

ysbaddaden commented 7 years ago

The stacktraces aren't memorized, and should be collected when the exception object is collected.

The decoded DWARF sections, used to have human friendly backtrackes, are memorized, and won't be collected.

crisward commented 7 years ago

Got it, thanks - https://github.com/crystal-lang/crystal/blob/ae678b204d8eeaeeb1125bc3fae42e8292041399/src/callstack.cr#L169

crisward commented 7 years ago

@ysbaddaden I think I've found the problem. @@dwarf_line_numbers seems to be what is eating over 60mb of ram on my production server.

I patched callstack with

struct CallStack
  def self.empty
    @@dwarf_line_numbers = nil
  end
end

I also created a method which causes an error.

def trip
  1 / 0
end

I linked both of these to controller endpoints. If I call the exception, the memory of my app goes up from 1.2mb to around 60mb.

As soon as I call the empty method. The free bytes jumps up to 58mb and then shortly afterwards the memory is returned to the OS, back down to 1.2mb of Heap.

The @@dwarf_line_numbers must be a lot smaller on a mac, as the memory only goes up to around 2mb. On ubuntu without the release flag, it only goes up to 10mb, but the empty method still brings it back down to 1.5mb. Not sure why the release version is much larger.

The earlier suspicions of Broken pipe are unfounded, it just happened to be the only uncaught exception I was getting live.

I suppose the two questions the leap to mind are

Does @@dwarf_line_numbers have to be memorized?
Is there anyway of doing this using much less memory?

ysbaddaden commented 7 years ago

60MB isn't significant. if compiled with --no-debug then it should use no memory at all.

Yes, memoization is required, otherwise

sdogruyol commented 7 years ago

@crisward i'm amazed by your debugging skills 💯

ysbaddaden commented 7 years ago

... it would require to reopen and redecode the debug sections, every time an exception is printed, which is costly (both memory and CPU) usage. This could be abused as an attack vector, too.

crisward commented 7 years ago

@ysbaddaden what about cleaning up the @@dwarf_line_numbers after a reasonable interval? I noticed there is already something the cleans-up unused fibers in kernel.cr. Something similar could be done.

The --no-debug option isn't easy for me to activate as I'm using the heroku buildpack. Also I'd quite like stack traces to be sent to sentry, I just don't want to have applications which can suddenly use more memory than I expected, the hold onto it. The exception which triggered this only happens maybe once a day.

Crystal's low memory footprint is one of it's benefits. We run lots of content managed sites from a single server. The chances of them all having uncaught exceptions at the same time is very low. So being able to clear the memory used after a reasonable interval would be good. I can obviously do this at the application level, but it could be useful to other developers.

benoist commented 7 years ago

60MB is significant if you normally only use 1MB. Having said that, 60MB shouldn't be a problem. If it is a problem because you want to run as many processes on a 512 VM, compiling with --no-debug seems like a good solution.

Even if you clear the dwarf line numbers, having your application spike like that can be dangerous if memory availability is low.

crisward commented 7 years ago

In someways, it may be better to assign the @@dwarf_line_numbers on application launch. At least you know what you're dealing with, instead of working under the false assumption you have a 1mb app running. Though I think if you're running lots of small apps, if the memory usage drops again after a minute, one or two of them spiking at once wouldn't be a problem. Also I find it odd that different setups create vastly different memory loads for the same app?

ysbaddaden commented 7 years ago

assign @@dwarf_line_numbers on application launch

That would affect programs with a short lifespan.

crisward commented 7 years ago

What about adding the patched method above. Perhaps call it CallStack.drop_exception_cache. Then document what it does so app developers can decide if they want to use it?

crystal-lang / crystal

GC doesn't seem aggressive enough at times #3997