SwellRT / swellrt

SwellRT main project. Server, JavaScript and Java clients
http://swellrt.org/
Apache License 2.0
234 stars 34 forks source link

Service instable #243

Open emcodemall opened 6 years ago

emcodemall commented 6 years ago

Hey,

in my application, there are 2 editors and 3 viewers on the same document connected simultanously. It looks like after the 2 editors did their work for about 2 hous, inputting about 500 lines, the server application does not respond anymore a few hours later. I am working on this now for some months, got a really performant server 8 cores, 12gb ram and ssd. With the new server, the stability improved from about 1-2 hours concurrent usage to 3 hours concurrent usage plus 6 hours running idle after the usage is over.

What i wonder of is, how do you run this for jetpad.net? Do you periodically restart the server or use a special version? ...i just checked out the latest version that was tagged with "version bump" (as there is no "release" afaik).

Cheers! Harald

pablojan commented 6 years ago

Hi Harald, thanks again for struggling with this matter.

After reading your old comments weeks ago I just could test the stability of the server while is idle. I didn't see deterioration of heap and cpu in the JVM.

Jetpad is rebooted frequently by the operators. So I can't see the issue you are pointing out, but of course it happens from time to time.

I really would like to help you to tackle this issue. First I would monitor the server's JVM using jconsole. Could you do it? I suspect that there is a memory leak and heap gets exhausted.

In addition, it is very important to tweak server's thread configuration. Are you using the default values? See the "threads" section of the config/reference.conf file

The latest commit on master has a lot of small fixes and improvements, it should be more stable but doesn't include any specific change on performance. The tag is: https://github.com/P2Pvalue/swellrt/releases/tag/2.0.0-beta

We could discuss and work together all this by chat/conference if you like. If we found the cause I will be happy to patch the server quickly.

1) To remote jconsole monitoring, run JVM with following options: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5000 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

emcodemall commented 6 years ago

Thanks for your promt response and your will to help, now you got me enthusiastic again too. Sure we can get more direct contact, just give me some details on harald.jordan78@gmail.com.

I am currently checking out the suggested version, added the jmx options to the gradlew.bat "DEFAULT_JVM_OPTS". Compiling dev version right now and changing my code to support the newer version...

pablojan commented 6 years ago

Prod/Dev compile options only matters to js client library, won't impact on server issue.

El mié., 29 ago. 2018 a las 20:15, haraldjordan78 (notifications@github.com) escribió:

Thanks for your promt response and your will to help, now you got me enthusiastic again too. Sure we can get more direct contact, just give me some details on harald.jordan78@gmail.com.

I am currently checking out the suggested version, added the jmx options to the gradlew.bat "DEFAULT_JVM_OPTS".

Does it matter for you if i compile prod or dev?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/P2Pvalue/swellrt/issues/243#issuecomment-417052403, or mute the thread https://github.com/notifications/unsubscribe-auth/AEP6D5RpB8_-YzoPYzgqezr7tbaHkVCMks5uVtpAgaJpZM4WR9pl .

pablojan commented 6 years ago

Results from test with 4 users, 2 of them writing continuously with automatic script (see project swellrt-selenium):

Initial Server boot, New Wave loaded => heap ~90MB Test run (~45min) => heap grows up to ~400MB (see picture 1) Server reboot, no waves loaded => heap ~ 90MB Wave reloaded => heap 300MB, then GC adjust to 200MB (wave in-memory size is ~100MB)

Picture 1: heap memory during test imagen

Picture 2: heap memory after test end, server reboot and wave loaded again imagen

pablojan commented 6 years ago

I have created a new tag with some improvements regarding memory consumption: https://github.com/SwellRT/swellrt/releases/tag/2.0.1-beta

Repeating the previous test I got following results (I shortened the test time due to obvious results...)

captura de pantalla de 2018-09-02 13-12-55

emcodemall commented 6 years ago

Hey Pablo,

thanks a lot for your efforts! As you know, i am not a Java JVM expert, but i’ll try to Interpret your data (thinking technically):

Analysis of „before“ Memory consumption was growing About 250MB in one hour, gc frequency speeding up over time

Analysis of „after“ Memory consumption was growing About 20-30MB in 20 minutes, gc frequency speeding up over time

Please correct me if i am wrong, but from feeling i Interpret that there might be still one or multiple Memory leaks.

Should i possibly drive some test doing the same as you and after that, idle for About a day, then check what we still have in Memory compared to start? …or similar?

Cheers! Harry

Von: Pablo Ojanguren Gesendet: Sonntag, 2. September 2018 13:41 An: SwellRT/swellrt Cc: haraldjordan78; Author Betreff: Re: [SwellRT/swellrt] Service instable (#243)

I have created a new tag with some improvements regarding memory consumption: https://github.com/SwellRT/swellrt/releases/tag/2.0.1-beta • Fix deltas in memory collection bug • Disable user's presence tracking defaults (user presence feature can make heavy use of transient wavelet) • Configurable user presence event rate • Safer rate control of caret update events (caret update events was doing heavy use of transient wavelet) • Properly clean deltas cache (cached deltas in memory were not properly flushed after being persisted) • Store transient data in db to reduce memory use Repeating the previous test I got following results (I shortened the test time due to obvious results...)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

pablojan commented 6 years ago

Sure, please run some test, one idle and one with activity would be ideal.

Let's see how heap behaves then. In any case, wave's server holds in memory snapshots of each wave (swell objects) so their size must increase over the time when users write text...

Anyway, work on this is a good exercise to optimize the server,

Cheers

El 2 sept. 2018 8:45 p. m., haraldjordan78 notifications@github.com escribió:

Hey Pablo,

thanks a lot for your efforts! As you know, i am not a Java JVM expert, but i’ll try to Interpret your data (thinking technically):

Analysis of „before“ Memory consumption was growing About 250MB in one hour, gc frequency speeding up over time

Analysis of „after“ Memory consumption was growing About 20-30MB in 20 minutes, gc frequency speeding up over time

Please correct me if i am wrong, but from feeling i Interpret that there might be still one or multiple Memory leaks.

Should i possibly drive some test doing the same as you and after that, idle for About a day, then check what we still have in Memory compared to start? …or similar?

Cheers! Harry

Von: Pablo Ojanguren Gesendet: Sonntag, 2. September 2018 13:41 An: SwellRT/swellrt Cc: haraldjordan78; Author Betreff: Re: [SwellRT/swellrt] Service instable (#243)

I have created a new tag with some improvements regarding memory consumption: https://github.com/SwellRT/swellrt/releases/tag/2.0.1-beta • Fix deltas in memory collection bug • Disable user's presence tracking defaults (user presence feature can make heavy use of transient wavelet) • Configurable user presence event rate • Safer rate control of caret update events (caret update events was doing heavy use of transient wavelet) • Properly clean deltas cache (cached deltas in memory were not properly flushed after being persisted) • Store transient data in db to reduce memory use Repeating the previous test I got following results (I shortened the test time due to obvious results...)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

emcodemall commented 6 years ago

Hi!

using the dev build you configured not to send any annotations about a week ago, i recognized the latest service dysfunctions (clients disconnected, java high memory and cpu usage) with this process:

grafik

2 swell instances were configured with jmx, 2 not. The bad news is that one swell instance that did not have jmx enabled also showed high memory and cpu usage and disconnected clients after 1 hour of usage. Unfortnately i needed to restart the processes immediately for production so i could not collect any more evidence.

This was done having 2 editors and 2 viewers online at the same document concurrently.

At the time when the clients disconnect, it looks like only the java process with jmx has high memory and cpu usage.

Anyway, i'll keep trying to collect evidences. Also i'll try to disable jmx so there is no extra java instanace running. It would be cool to even save more memory and also disable the gradle instance.

pablojan commented 6 years ago

I can't see the entire command literal string, so it is a bit misleading. But, how many server instances are you executing? Do you verify the JVM process is really killed every time you stop (Ctrl+C) the ./gradlew run command? the OS could let the process in a zombie state.

If you want to avoid using gradle to run the server, you must build a tar/zip distribution file with the command

./gradlew createDistBinTar

The generated file is placed at distributions/ folder. Extracts the file and use the run-server.sh or run-server.bat scripts to start the server. In these scripts you can enable or not JMX monitoring and/or remote debugging.

Remind to edit configuration in config/wave.conf based on config/reference.conf.

El mié., 24 oct. 2018 a las 18:38, haraldjordan78 (notifications@github.com) escribió:

Hi!

i recognized the latest service dysfunctions (clients disconnected, java high memory and cpu usage) with this process:

[image: grafik] https://user-images.githubusercontent.com/34220041/47446236-1c646d80-d7bb-11e8-8242-1eb118eba3cf.png

To be honest, i am a little confused about all the java processes. In the screenshot, you see 4 instances of swellrt server on the bottom (-Dorg.gradle.appname=gradlew), then above them you see 2 processes that run "Djava.security.auth..." and above them you see 2 processes "Dcom.sun.management.jmx...". At the time when the clients disconnect, it looks like only the java process with jmx has high memory and cpu usage.

i am not sure why there are not 1 jmx and Djava.security... process for each of the swell instances, the config was copied. Also i am not sure if the actual problem was with the jmx process or something else actually in the swell process causing the misbihaviour in jmx. Anyway, i'll keep trying to collect evidences. Also i'll try to disable jmx so there is no extra java instanace running. It would be cool to even save more memory and also disable the gradle instance.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SwellRT/swellrt/issues/243#issuecomment-432733266, or mute the thread https://github.com/notifications/unsubscribe-auth/AEP6D7tFxIgxaHxv_DEqp6Gvs_UJdT9Aks5uoJd1gaJpZM4WR9pl .