brettwooldridge / NuProcess

Low-overhead, non-blocking I/O, external Process implementation for Java
Apache License 2.0
710 stars 84 forks source link

Don't depend on the GC to free native memory on Linux #151

Open avrecko opened 5 months ago

avrecko commented 5 months ago

JNA is freeing 192 KiB of per NuProcess native buffers on GC Finalizer / Reference Queue processing. In cases with very low GC activity, this is essentially a memory leak, potentially causing native memory issues.

avrecko commented 5 months ago

@bturner What are your thoughts on explicitly freeing the per process EpollEvent on call to onExit()? The complication is that some synchronization guards are needed to avoid any API misuse that can cause SegFault.

E.g. process.start().waitFor(0,Days); then doing process.wantWrite(); can cause SegFault in rare case (easy to reproduce) if EpollEvent native memory is freed onExit and no additional guards are put in place in queueWrite.

avrecko commented 5 months ago

Also what are you thoughts of making the buffer configurable? Like a global setting?

bturner commented 5 months ago

What are your thoughts on explicitly freeing the per process EpollEvent on call to onExit()?

For this review, I'd leave it out. Synchronization brings in its own costs that need to be balanced. I'm not saying we shouldn't do it, but rather that it's better to separate "contentious" changes from the simple ones that can be readily merged. That way I can accept this pull request and do a new release with some improvements, and then we can have a more focused discussion about further improvements and the trade-offs they entail.

Also what are you thoughts of making the buffer configurable? Like a global setting?

I'm not sure what buffer you're referring to. The size for the stderr, stdout and stdin buffers? Something else?

avrecko commented 5 months ago

Ok agreed. I'll put the explicit close of per process EpollEvent in another pull request.

Yeah, I saw the offset handling regarding the EPoll struct. Wow. Didn't image JNA to be this inefficient to merit doing it yourself with offsets. Not very convenient then.

fwiw: doing some simple testing also not seeing the benefit of polling 20 instead of 1 event. I do consistently see 2 events queried. But anyway, for now I left the poll alone.

For the configuration I meant for setting of NuProcess#BUFFER_CAPACITY. For my use case, I think 4K would work just fine.

I'll clean up the code and let you know when it is ready to be reviewed again.

V V pon., 12. feb. 2024 ob 19:23 je oseba Bryan Turner < @.***> napisala:

What are your thoughts on explicitly freeing the per process EpollEvent on call to onExit()?

For this review, I'd leave it out. Synchronization brings in its own costs that need to be balanced. I'm not saying we shouldn't do it, but rather that it's better to separate "contentious" changes from the simple ones that can be readily merged. That way I can accept this pull request and do a new release with some improvements, and then we can have a more focused discussion about further improvements and the trade-offs they entail.

Also what are you thoughts of making the buffer configurable? Like a global setting?

I'm not sure what buffer you're referring to. The size for the stderr, stdout and stdin buffers? Something else?

— Reply to this email directly, view it on GitHub https://github.com/brettwooldridge/NuProcess/pull/151#issuecomment-1939296559, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACRDOL5KOPXAAXLVQ57V3TYTJMZ7AVCNFSM6AAAAABDDYRJMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZZGI4TMNJVHE . You are receiving this because you were mentioned.Message ID: @.***>

bturner commented 5 months ago

fwiw: doing some simple testing also not seeing the benefit of polling 20 instead of 1 event. I do consistently see 2 events queried. But anyway, for now I left the poll alone.

That matches what I saw before. I could see >1 event being returned, but performance and profiling characteristics were identical either way. (Polling for more can reduce the number of syscalls, but since handler processing dominates the performance so strongly, taking away milliseconds (or, more likely, nanoseconds) of syscall execution simply doesn't move the needle.) Based on that, I think leaving it at single poll is probably fine (and keeps the code simpler).

For the configuration I meant for setting of NuProcess#BUFFER_CAPACITY. For my use case, I think 4K would work just fine.

Brett and I had previously at least touched on the idea of making that configurable (see here). A system property approach would be pretty straightforward, so it might be a good place to start. I've thought before about what might be involved in making it more readily runtime-configurable (e.g. so that, when you start a process, you can request a buffer size as part of that, because not all processes have the same I/O needs), but haven't actually sat down to try and do it.

avrecko commented 5 months ago

Have same thoughts about setting of out/err/in buffer sizes. I'll prepare another pull request for system property approach. I also think it is a good place to start. For more fine grained control, process builder is probably the place to add it. But maybe unnecessary.

avrecko commented 5 months ago

@bturner I think this is it. Updated JNA to get the #close method, out/err/in buffers are closed and added the tweak to reuse IntByReference pointer by the epoll/kqueue.

Btw: wow, the Kqueue is using the Struct KEvent, indeed it is heavy. I expected some bespoke code generators from JNA. Not even close.

I still have my eyes on the per process Event allocation that isn't freed right away. I'll think of something.

avrecko commented 5 months ago

For the per process event. I know you'd prefer to do this in a different PR. But I think I came up with a solution that is simple enough.

Event is needed by the registerProcess() and by the wantWrite() and writeStdin(ByteBuffer).

For the registration. We can allocate the event and free right away inside the registration method. This way the per process event is not needed. If you don't do stdin stuff no further allocations are necessary.

For stdin handling. I prefer on demand init of the event that is then used by calling wantWrite and writeStdin. Technically multiple threads can call wantWrite and writeStdin. But the race is not an issue since all threads are setting the same values that are atomic (on x86 at least).

      event.setEvents(LibEpoll.EPOLLOUT | LibEpoll.EPOLLONESHOT | LibEpoll.EPOLLRDHUP | LibEpoll.EPOLLHUP);
      event.setFileDescriptor(stdin);

If 2 threads would be calling writeStdin and the fields would be 64bits, a tear could occur. But this is effectively tread safe (on platforms where 32bit are atomic). I see you use ConcurrentLinkedQueue<ByteBuffer> so I assume concurrent access to writeStdin is expected.

Apart from on demand per process init of Event when doing wantWrite and writeStdin. I also added closeStdinEpollEvent for immediate release (if don't want to wait on GC). This way the control of Event is on the caller as we cannot be sure no further wantWrite and `writeStdin' will be done on exited process. So we cannot use onExit to close the stdin event.

Let me know what you think. Like mentioned I want all of the malloc stuff to be explicity freed when process exits. With this changes this is achieved for start() and also run.

Aside: Expected run to return int.

bturner commented 5 months ago

That approach doesn't really make sense to me, unfortunately.

For point 1, if we're not massively concerned about allocating a few EpollEvents (and, so long as they're explicitly released, I'm not sure we need to be), then I'm not sure why you wouldn't make a similar change in ProcessEpoll::queueWrite to what you have made in registerProcess and "just" have queueWrite create a short-lived EpollEvent, use it to register and then close it.

For point 2, if we don't want to allocate an EpollEvent in each queueWrite call, I would revert your changes to registerProcess, so it uses the single EpollEvent, which would continue to be allocated during LinuxProcess creation, for both registration and writes, and I'd replace closeStdinEpollEvent with a more generic releaseNative or cleanup. This leads to no more allocations than the code currently has, while still giving callers a mechanism to "force" cleanup rather than waiting.

If we're going to add a new method, I don't think it's "correct" to add it just to LinuxProcess. Callers shouldn't be casting to the specific process classes; they should be using NuProcess. Based on that, I'd add whatever method we add for this to that interface. (It can't have a default implementation since we still support Java 7, but the existing classes can add no-op implementations.)

Personally, at this point, I'm inclined to go with "just" allocating EpollEvents in queueWrite and closing them, or just leave the code as-is. Adding new "cleanup" handling through NuProcess feels like a really heavy solution to a "problem" I'm still not sure I understand. (Read: Optimizing for this feels like a bit of a waste of time, to me. JNA will free the native memory; it's just not the sort of deterministic schedule you personally would like it to be on. At 12 bytes per EpollEvent, you'd need to have allocated tens of thousands of them for the memory usage to add up to anything significant--and you'd need JNA to have not freed any of them on top of that. In multiple years of working on Bitbucket Server with it using NuProcess under the hood, we never once experienced internally or received a report externally of an instance of an OOM error related to JNA's native memory handling--and that's with the 192KB of buffers not being explicitly freed on top of the 12 bytes for an EpollEvent, despite forking millions of git processes over the JVM's lifetime.)

avrecko commented 5 months ago

@bturner If you are cool with allocating/freeing per queueWrite this simplifies things greatly.

I think JNA and also JNI does some small allocations per method call anyway. So probably not a big deal if we do as well.

avrecko commented 5 months ago

I've updated the title to be less generic and updated the first comment of this PR.

I've updated the code to reflect the latest discussion.

avrecko commented 5 months ago

@bturner I left the run use case alone. Could also add explicit free for run use case. But would have to add a method probably releaseNative to the processor. I actually added this for Linux but then it makes more sense to add it to the top level and implement for all of the 3 OSes. So I reverted the change.

Leaving run use case as is. Allocate and free per method for register and queueWrite works for me. We can leave freeing per processor jna fields alone as they don't cause issues with GC for start use case. They are per classloader for start use case. This works fine for me.

To reiterate the problem I am solving. In some GC setups. It might take hours between GC cycles that trigger reference processing. In this time nonnegligible amouth of native memory accumulate that cause native memory issues.

See: https://mail.openjdk.org/pipermail/zgc-dev/2023-August/001260.html