rdp commented 7 years ago

See if this one seems to use up an inordinate amount of RAM for me (it seems to for me on all OS's I tried anyway):

class Foo
  def finalize
    # Invoked when Foo is garbage-collected
    a = 3
    "bye #{3+4}"
  end

end

# Prints "Bye bye ...!" for ever
loop do
   Foo.new
end

I wonder if it's related to boehm gc finalization cycle or something. Let me know if reproducible. Thank you.

waj commented 7 years ago

Objects with finalizers are collected after two GC cycles. The first cycle will run the finalizer and the second one will collect the object memory. If you add a call to GC.collect inside the loop the memory usage will stay really low.

For this reason I think it's a bad idea to allocate so many unreachable objects with finalizers in a tight loop. In fact, I think finalizers should be avoided whenever possible. They are useful to release OS resources but otherwise there might be always another (more deterministic) option to release resources or finish some work.

Can you give more details about what are you trying to accomplish with finalizers?

rdp commented 7 years ago

It does stay down with GC.collect (wait is GC.collect not being called underneath? I see finalizers being called? Is there a difference between GC.collect and standard garbage collection, and a reason why GC.collect seems to be called infrequently while RAM usage goes up and up?). That exception trace still occurs occasionally (at least on OS X).

I'm trying to debug here why servers on "some machines" seems to grow in used memory forever. The SSL sockets used do use finalizers FWIW, but I somehow doubt they're the culprit https://github.com/crystal-lang/crystal/issues/3983

Maybe crystal is keeping an "old reference" to ancient fibers around, or maybe only allocating small amount of memory each time which BDW doesn't count towards it "should gc now" metric or some odd?

benoist commented 7 years ago

What I'm seeing in regards to GC is that the amount of heap memory seems to keep growing, but the actual memory usage is mostly lower see #3992 If I print GC stats I get this result:

GC::Stats(@heap_size=3233890304, @free_bytes=3233832960, @unmapped_bytes=0, @bytes_since_gc=0, @total_bytes=8000019728)

As you can see in my example the free bytes is almost the same as the heap size, maybe that's the case here as well?

rdp commented 7 years ago

The good news here is that if I add in an explicit GC.malloc every so often:

spawn do
  loop do
    sleep 5
    GC.collect
  end
end

It seems RES memory is "constrained" to around 1GB. Or if it's sleep 0.5 then stays around 100 MB. So the GC feels like it's working...not allowing leaks...just not being called often enough perhaps for some odd reason?

RX14 commented 7 years ago

@rdp GC.collect is expensive, ram not being used by other programs is almost free. For that reason, it's likely that the GC will instead simply use more ram instead of reclaiming more often under heavy allocation pressure (i.e. allocating in a loop).

ysbaddaden commented 7 years ago

Could we talk about GC in a single thread and close other issues? They're all related and down to the same thing: don't allocate repeatedly.

I believe @RX14 put the final nail: don't test GC on something that it's very bad at, and warns you repeatedly about it, telling exactly what you notice: memory leak!

If you want to test the GC, do it on real applications. Don't allocate in a tight loop, but reuse a single Array object for example, instead of allocating a new one on each iteration. HTTP::Server keeps HTTP::Response objects for example, resets and reuses them for another request.

benoist commented 7 years ago

Well the reason I started asking these questions is because I'm seeing it in my real application. The actual amount of items assigned in an array is a lot lower and it takes a lot longer for the memory to have grown to a size where it causes problems, but it is causing a problem. The only option at that point is to restart the application and it can run fine for days.

Any tips on how to test where the leaks actually are in a real application on a production server without reducing code to things the GC wasn't made to do in the first place?

rdp commented 7 years ago

OK I see now that allocations in finalize don't leak per se, so I'm proposing we kind of hijack this as the main thread for cases of "if I perform periodic GC.collect, my process stays at 10MB, otherwise it goes up to 5GB", if that's all right. Here's an example:

class Foo
  def finalize
    "bye #{3}"
  end
end

loop {
  Foo.new
}

I guess what I'd like answered is...if you allocate memory frequently, why is a full GC.collect (which would seem useful) seeming to not be being fired, or being fired only rarely? Are there bdwgc options that might help, for instance:) Just for discussion, as it were...

3992 doesn't seem like quite the same issue, since it's more concerned with allocating very large objects, and GC.collect didn't help there at all, FWIW.

ysbaddaden commented 7 years ago

A quick search turned this answer, which looks interesting: https://www.quora.com/Is-there-any-way-to-implement-a-smart-pointer-in-C

rdp commented 7 years ago

3333 might have been related to this

sa-0001 commented 7 years ago

I also have an issue with runaway memory use due to repeated allocations within a loop:

fileAll = File.open("/target-path/all.xml", "a")

Dir.glob("/source-path/*.xml").sort.each do |path|
    xml = File.read(path, "utf8")
        .gsub(/>[\s]*\n[\s]*</, "><")
        # ... various replaces ...
        .gsub(/<\/Product>/, "</Product>\n")

    fileAll.puts xml
end

fileAll.close()

Of course I get the classic error "GC Warning: Repeated allocation of very large block (appr. size 35000000): May lead to memory leak and poor performance.".

I have tried the code as seen above, along with various other experimental measures:

GC.collect at top of loop
GC.collect at bottom of loop
explicitly setting xml to null at bottom of loop
open/close file within loop
combinations of the above

In all cases, as the script processes these files of about 20-40MB, the memory usage inevitably creeps up and eventually will crash.

Most advice I see seems to be simply "don't repeatedly allocate within a loop". But ... how to avoid it?

rdp commented 7 years ago

What are you allocating that needs size 35000000 I wonder...

On Wed, Mar 1, 2017 at 6:12 AM, sa-0001 notifications@github.com wrote:

I also have an issue with runaway memory use due to repeated allocations within a loop:

fileAll = File.open("/target-path/all.xml", "a")

Dir.glob("/source-path/.xml").sort.each do |path| xml = File.read(path, "utf8") .gsub(/>[\s]\n[\s]*</, "><")

... various replaces ...
  .gsub(/<\/Product>/, "</Product>\n")
fileAll.puts xml end

fileAll.close()

Of course I get the classic error "GC Warning: Repeated allocation of very large block (appr. size 35000000): May lead to memory leak and poor performance.".

I have tried the code as seen above, along with various other experimental measures:

GC.collect at top of loop

GC.collect at bottom of loop

open/close file within loop

combinations of the above

In all cases, as the script processes these files of about 20-40MB, the memory usage inevitably creeps up and eventually will crash.

Most advice I see seems to be simply "don't repeatedly allocate within a loop". But ... how to avoid it?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/crystal-lang/crystal/issues/3997#issuecomment-283335794, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAw0Mgew3MEZNX1vMWJFu9GfLXHnNuoks5rhW6ogaJpZM4L21j8 .

sa-0001 commented 7 years ago

What are you allocating that needs size 35000000 I wonder...

The script processes hundreds of files of around 20-40MB. Incredibly, the same script in Node.js doesn't crash, although its memory usage certainly levels off pretty high.

RX14 commented 7 years ago

@sa-0001 the best solution would be to not load the whole file into memory, but as far as I know crystal doesn't have any good ways to work with replacements on streams the same way you can on strings, so I won't advise that.

One thing you sould try is using GC.free(xml.as(Void*)). Make sure you put this after the last usage of the object. And I still can't guarantee it's safe. Obviously you shouldn't have to do this.

ozra commented 7 years ago

I'm just making a ping here so that I have track on this issue - ignore me B-)

crisward commented 7 years ago

Keeping these memory discussions in one place. I'm finding crystal never seems to reduce it's heap size even when traffic drops. Instead the GC seems to run less often. This may be intended behaviour.

This graphed data is a section from yesterday https://www.duodesign.co.uk/memstats/ I've attached an image of the graph as the above link may change in the future.

screen shot 2017-03-02 at 09 51 46

At 13:14 I ran apache bench and as you'd expect the memory usage leapt up. It then stays there until I restart crystal.

RX14 commented 7 years ago

Libgc is a non-compacting garbage collector, which buckets allocations into seperate heaps based on allocation size. This means that the heap(s) can get fragmented. Also, i'm not sure what the heuristics for when to return heap to the OS actually are.

Again, if you want an answer from people who know what they're talking about, ask the libgc guys.

@ozra you can subscribe to issue notifications without commenting by clicking here:

crisward commented 7 years ago

https://github.com/ivmai/bdwgc/issues/152#issuecomment-283733454

First, libgc should be built with configure --enable-munmap option. An optional parameter can be specified along with the option - to set the desired memory blocks unmapping threshold (the number of sequential garbage collections for which a candidate block for unmapping should remain free).

Not sure if this is set? Though if not, it would explain the current behaviour

After some testing with @sdogruyol's help, we discovered go-lang seems to release memory to the os after around 5 mins, returning to pre-load state after up-to 30 mins. It would be good if crystal could mimic this behaviour.

asterite commented 7 years ago

It seems like a good thing to try. Right now we aren't setting that flag. I didn't even know libgc could be configured like that (I thought it would never give the memory back to the OS).

On OSX we are using bdw-gc from homebrew, which doesn't seem to set it either...

I wonder if there's a reason why it's not enabled by default. Why could it be?

crisward commented 7 years ago

The only docs I can find referencing this are here which makes me wonder if it's a windows only option.

crisward commented 7 years ago

I've compiled libgc with that option on mac. I'll try building crystal with it and try some tests.

luislavena commented 7 years ago

@crisward here are docs about it:

https://github.com/ivmai/bdwgc/blob/64bed186fc1ca9a654c9bca063089f5b7ead87be/doc/README.macros#L287-L291

It seems it combines it with mmap:

Causes memory to be returned to the OS under the right circumstances. This currently disables VM-based incremental collection (except for Win32 with GetWriteWatch() available). Works under some Unix, Linux and Windows versions. Requires USE_MMAP except for Windows.

crisward commented 7 years ago

I've got it working.

crisward commented 7 years ago

I set it to 3, so it returns memory after 3 gc cycles. ie.

brew install automake pkg-config #mac only
git clone https://github.com/ivmai/bdwgc.git
cd bdwgc
git clone https://github.com/ivmai/libatomic_ops.git
autoreconf -vif
automake --add-missing
./configure --enable-munmap=3 
make
make check
sudo make install

If you want to reproduce this on mac, you need to do the brew steps above

luislavena commented 7 years ago

Looks promising @crisward, but perhaps 3 cycles is too short for a long running process (and might result in slowdown when freeing and capturing these resources back/from OS during high traffic).

There are also interesting details unrelated to this but also apply to parallelism: by default libgc might fail in multi-threaded environments unless nummap is used (ref). I think that change is long gone, but usage of mmap and possible parallel marking might be required for thread support.

/cc @asterite

crisward commented 7 years ago

@luislavena you may be right. However I found with high memory usage the GC cycle seems to slow down to once every 3 minutes, so 3 would equate to 9 minutes (see my first graph above). Go seems to start returning memory after around 5, so this seems roughly inline. Like anything I imagine this is going to take some real world use to fine a good balance.

ozra commented 7 years ago

A couple of months ago I poc worked a bit on a set of traits (modules) that would facilitate and sort of "standardize" different methods of pooling / re-using / re-initialization -sort of getting manual style #free() methodology while still having the safety net of GC.

I think standardizing such a framework in stdlib is well worth elaborating on - re-using objs and taking some responsibility of saying when one is done with a resource of the machine is prudent and can help immensely with memory usage, avoiding the GC crunching first and foremost.

[ED: the problem back then, to make it cleanly "user-coder"-friendly - keeping user's code DRY - was that the 'piler insists on doing the "reduced extent inference" in initializers - something I thought it could ignore if specifying distinct types for all ivars (despite it not being able to see that it actually holds true - for compilation performance reasons) - but I apparently still hadn't grok'd the compiler phases fully despite digging through the source over and over, haha. When I have less stress on the plate, I will be back diving in to the guts of it again. I love the stuff you've all done on the compiler. It's just so clean compared to other's I've hacked on and studied [names omitted]!)

I'm really interested in these aspects, have made many allocation models over the years for embedded situations etc - it's a fun puzzle. Of course, I'm strapped with time, I haven't even had time to check up on the developments in Crystal the last few months. But, anyhow, those two cents.

Obviously we all are eager to jump in and code a whole bunch of precise GC's in the future - right!? :-) If the API for the GC is kept well thought out - there's room for specialized variants optimized for multi-threaded, single-threaded, throughput, latency - whatever.

But now, I'm sailing away into "someday, maybe" mode - sorry!

@RX14 - OT (meta questions): so clicking that will land it in "participating" automatically, saving me from looking like a fool for ping-posting? B-)

RX14 commented 7 years ago

@ozra yes, it'll keep you notified of the issue.

crisward commented 7 years ago

Any ideas how I'd get a compiled version of bdwgc into the heroku buildpack?

luislavena commented 7 years ago

@crisward there used to be something called vulkan to compile binaries for Heroku. Not sure what is the method nowdays.

But you will need to fork the buildpack to change the bundled libgc with the one you produced before.

crisward commented 7 years ago

@luislavena Thanks.

crisward commented 7 years ago

Just though it worth feedback my findings. I've overloaded a server (in dev mode) with 1000 requests with 50 concurrent connections.

ab -n 1000 -c 50 http://127.0.0.1:3000/

The app uses crystal-mysql and kemal so includes a handful of database queries per connection.

The app starts at 5.8mb of memory (after a bit of clicking round). After the bench it goes up to around 20mb. Shortly after the spike, it releases just under half the memory back (down to 11mb). Then after 3/4 gc cycles it start to release memory back to the OS in every decreasing chucks. Getting back to 5.8 after about 1.5hrs. Though it's already down to 6.4 after 30mins.

Over the weekend I'm going to try and get this change into a fork of the buildpack so I can try this out on my liver server using Dokku.

screen shot 2017-03-03 at 10 33 47

benoist commented 7 years ago

Nice @crisward, this is more inline with what I expected to see with the heap size.

crisward commented 7 years ago

I've looked at the Heroku buildpack and it looks like it is getting a release of Crystal from github which has libgc included in the /embedded/lib folder. Libgc needs building with the options outlined about and adding to this release.

The buildpack current uses

https://github.com/crystal-lang/crystal/releases/download/0.21.0/crystal-0.21.0-1-linux-x86_64.tar.gz

Who would be able to do this?

luislavena commented 7 years ago

@crisward that will be Manas team, which is why I suggested you fork the buildpack and introduce your own copy in the compile script (to replace the one from the package).

You can download another file and replace it, then point your app to your custom buildpack and should work.

luislavena commented 7 years ago

(I mean: your own copy of the libgc.a that you uploaded somewhere accessible by the buildpack/heroku build step)

benoist commented 7 years ago

I think it needs to be added here https://github.com/crystal-lang/omnibus-crystal/blob/master/config/software/bdw-gc.rb

asterite commented 7 years ago

Should we also send a PR to homebrew's formula to fix this?

luislavena commented 7 years ago

@asterite isn't that too early? from the comment in bdwgc might not be ready for prime time.

asterite commented 7 years ago

@luislavena Oh, I didn't know he replied to me :-)

So for now I don't think there's much we can do here other than wait, right?

luislavena commented 7 years ago

I think needs more testing. I'm to update my local build and do a bit of that this weekend and report back.

I might recommend a custom build of libgc.a, upload somewhere like S3 and then fork the buildpack to patch embedded one, then anyone can deploy to Heroku the update version without having to compile everything.

crisward commented 7 years ago

I'll fork crystal, the buildpack etc and give it a test. If this works, we could perhaps have an opt-in experimental buildpack release?

luislavena commented 7 years ago

@crisward not sure forking crystal is required, just compile a new version of libgc.a, either by using the omnibus-crystal recipes or on Heroku itself (heroku run bash) and upload the new libgc.a somewhere you can replace the one from the package.

crisward commented 7 years ago

Will do, thanks.

crisward commented 7 years ago

I've patched the buildpack with a re-compiled libgc.a and it's working well.

crisward commented 7 years ago

I've got the GC returning memory to the OS. However after leaving it running for over 12 hours over night, the GC starts not collecting until it almost exhaust available space, then starts GC cycles again.

It's running inside a Docker container so it's limited to 100mb, it using around 70mb.

GC is only running about every 30 minutes now. I've got my app set up so I can manually trigger a GC cycle and memory usage comes down if I do this. But the goes back up and GC doesn't run again until memory is high.

There is a global variable describe here https://github.com/ivmai/bdwgc#the-c-interface-to-the-allocator GC_free_space_divisor which seems to allow the GC to be tuned to favour cycles over memory use. It could be good if this was exposed via the GC interface.

Anyone have ideas how I'd go about doing this? I'd like to try and feed it different values to see how it effects my applications memory consumption, ideally during runtime.

crisward commented 7 years ago

I'm doing this by changing the setting from 3 to 4 (I think) Not sure if the value is a reference or a copy? Hoping it's a reference.

lib LibGC
  $free_space_divisor = GC_free_space_divisor : UInt64
end

LibGC.free_space_divisor = 4_u64

crisward commented 7 years ago

I've done two tests now, over 2 14 hour periods. The first before the change to 4. In both cases after around 12 hrs I'm getting a spike in memory from 5mb 80mb at 3 and 4mb to 40mb at 4.

Not sure what's causing the spike. The first was at 7:24am, the second at 8:28am the next day. I've checked logs and a there was almost no activity at all. Right now I could do with some way of listing memory allocations... I'm guessing there is now way of doing this. Also taking 12hrs to reproduce is a real pain.

ozra commented 7 years ago

Hack the allocation funcs in crystal so they count alloc stats per class in mem, that can be dumped by signalling the app? Prealloced big fat chunk mem for stat array (maybe via mmap to keep it outside GC too if you want), so that stays static after start all the way through...

crisward commented 7 years ago

Found the cause of the spike. https://github.com/kemalcr/kemal/issues/323 fixed in kemal.

Using the flag in the GC is working well. I don't think the LibGC.free_space_divisor isn't necessary.

Now I've fixed that bug I'll monitor memory for a week and let you know, but so far it's returning to the os when it can and holding around 3mb.

crystal-lang / crystal

GC doesn't seem aggressive enough at times #3997

3992 doesn't seem like quite the same issue, since it's more concerned with allocating very large objects, and GC.collect didn't help there at all, FWIW.

3333 might have been related to this

... various replaces ...