IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
879 stars 486 forks source link

Connection Pool: Glassfish is running out of db pool connections in production. #2595

Closed kcondon closed 8 years ago

kcondon commented 9 years ago

This ticket is about investigating, better tuning, and offering guidance in our doc for other installations, the db connection pool running out of db connections issue.

With v4.2, we've seen glassfish run out of db connections. At first this seemed related to the intensive index all operation but we've seen it in normal operation too.

Various links on tuning connection pools indicate this may be normal since db pool use is specific to a particular app, load, etc and needs to be adjusted to your circumstances by testing and monitoring.

There is also a possibility of a connection leak that can also be detected through monitoring.

One thought is now that we provide batch operations for editing files, this may result in more concurrent db connections.

http://blog.c2b2.co.uk/2014/02/an-introduction-to-connection-pools-in.html http://docs.oracle.com/cd/E18930_01/html/821-2431/abehp.html http://docs.oracle.com/cd/E19879-01/820-4335/ablip/index.html http://stackoverflow.com/questions/26772343/jdbc-connection-pools-in-glassfish-server

pdurbin commented 8 years ago

With v4.2, we've seen glassfish run out of db connections

For reference, here's what this looks like in server.log:

[2015-09-28T06:24:32.750-0400] [glassfish 4.1] [WARNING] [poolmgr.get_connection_failure] [javax.enterprise.resource.resourceadapter.com.sun.enterprise.connectors] [tid: _ThreadID=297 _ThreadName=jk-connector(10)] [timeMillis: 1443435872750] [levelValue: 900] [[
  RAR5117 : Failed to obtain/create connection from connection pool [ dvnDbPool ]. Reason : com.sun.appserv.connectors.internal.api.PoolingException: In-use connections equal max-pool-size and expired max-wait-time. Cannot allocate more connections.]]

[2015-09-28T06:24:32.753-0400] [glassfish 4.1] [WARNING] [jdbc.exc_get_conn] [javax.enterprise.resource.resourceadapter.com.sun.gjc.spi] [tid: _ThreadID=297 _ThreadName=jk-connector(10)] [timeMillis: 1443435872753] [levelValue: 900] [[
  RAR5114 : Error allocating connection : [Error in allocating a connection. Cause: In-use connections equal max-pool-size and expired max-wait-time. Cannot allocate more connections.]]]

[2015-09-28T06:25:04.261-0400] [glassfish 4.1] [INFO] [] [org.eclipse.persistence.session.file:/opt/dvn/app/glassfish4/glassfish/domains/domain1/applications/dataverse-4.2-92/WEB-INF/classes/_VDCNet-ejbPU] [tid: _ThreadID=297 _ThreadName=jk-connector(10)] [timeMillis: 1443435904261] [levelValue: 800] [[
  Communication failure detected when attempting to perform read query outside of a transaction. Attempting to retry query. Error was: Exception [EclipseLink-4002] (Eclipse Persistence Services - 2.5.2.v20140319-9ad6abd): org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: java.sql.SQLException: Error in allocating a connection. Cause: In-use connections equal max-pool-size and expired max-wait-time. Cannot allocate more connections.
Error Code: 0
Call: SELECT NAME, CONTENT FROM SETTING WHERE (NAME = ?)
    bind => [1 parameter bound]
Query: ReadObjectQuery(name="readSetting" referenceClass=Setting sql="SELECT NAME, CONTENT FROM SETTING WHERE (NAME = ?)").]]

[2015-09-28T06:27:13.465-0400] [glassfish 4.1] [WARNING] [poolmgr.get_connection_failure] [javax.enterprise.resource.resourceadapter.com.sun.enterprise.connectors] [tid: _ThreadID=298 _ThreadName=jk-connector(11)] [timeMillis: 1443436033465] [levelValue: 900] [[
  RAR5117 : Failed to obtain/create connection from connection pool [ dvnDbPool ]. Reason : com.sun.appserv.connectors.internal.api.PoolingException: In-use connections equal max-pool-size and expired max-wait-time. Cannot allocate more connections.]]

@scolapasta @landreev @kcondon and I met yesterday and we decided we'd like to figure out how to best monitor number of connections available in the pool and document this.

pdurbin commented 8 years ago

I think the value of asadmin get -m server.resources.dvnDbPool.dataverse.numconnused-highwatermark will be of interest in production, for example, but we should also figure out what the current value is (I assume this would be server.resources.dvnDbPool.dataverse.numconnused-current). On my laptop this value is "2". In order to see this I had to run asadmin set server-config.monitoring-service.module-monitoring-levels.jdbc-connection-pool=HIGH. I've been taking some notes about this at https://javabot.evanchooly.com/logs/%23glassfish/2015-10-02 but as I was saying there the best resource for this seems to be https://blogs.oracle.com/JagadishPrasath/entry/monitoring_jdbc_connection_pool_glassfish (even though it's from 2007).

On my laptop, here are some other values I can see:

murphy:~ pdurbin$ /Applications/NetBeans/glassfish4/bin/asadmin get -m "*"
server.resources.dvnDbPool.averageconnwaittime-count = 0
server.resources.dvnDbPool.averageconnwaittime-description = Average wait-time-duration per successful connection request
server.resources.dvnDbPool.averageconnwaittime-lastsampletime = 1443799279727
server.resources.dvnDbPool.averageconnwaittime-name = AverageConnWaitTime
server.resources.dvnDbPool.averageconnwaittime-starttime = 1443798006517
server.resources.dvnDbPool.averageconnwaittime-unit = millisecond
server.resources.dvnDbPool.connrequestwaittime-current = 0
server.resources.dvnDbPool.connrequestwaittime-description = The longest and shortest wait times of connection requests. The current value indicates the wait time of the last request that was serviced by the pool.
server.resources.dvnDbPool.connrequestwaittime-highwatermark = 109
server.resources.dvnDbPool.connrequestwaittime-lastsampletime = 1443799011242
server.resources.dvnDbPool.connrequestwaittime-lowwatermark = 0
server.resources.dvnDbPool.connrequestwaittime-name = ConnRequestWaitTime
server.resources.dvnDbPool.connrequestwaittime-starttime = 1443798006517
server.resources.dvnDbPool.connrequestwaittime-unit = millisecond
server.resources.dvnDbPool.dataverse.dotted-name = server.resources.dvnDbPool.dataverse
server.resources.dvnDbPool.dataverse.numconnacquired-count = 6216
server.resources.dvnDbPool.dataverse.numconnacquired-description = Number of logical connections acquired from the pool.
server.resources.dvnDbPool.dataverse.numconnacquired-lastsampletime = 1443799011242
server.resources.dvnDbPool.dataverse.numconnacquired-name = NumConnAcquired
server.resources.dvnDbPool.dataverse.numconnacquired-starttime = 1443798024035
server.resources.dvnDbPool.dataverse.numconnacquired-unit = count
server.resources.dvnDbPool.dataverse.numconnreleased-count = 6216
server.resources.dvnDbPool.dataverse.numconnreleased-description = Number of logical connections released to the pool.
server.resources.dvnDbPool.dataverse.numconnreleased-lastsampletime = 1443799011243
server.resources.dvnDbPool.dataverse.numconnreleased-name = NumConnReleased
server.resources.dvnDbPool.dataverse.numconnreleased-starttime = 1443798024035
server.resources.dvnDbPool.dataverse.numconnreleased-unit = count
server.resources.dvnDbPool.dataverse.numconnused-current = 0
server.resources.dvnDbPool.dataverse.numconnused-description = Provides connection usage statistics. The total number of connections that are currently being used, as well as information about the maximum number of connections that were used (the high water mark).
server.resources.dvnDbPool.dataverse.numconnused-highwatermark = 2
server.resources.dvnDbPool.dataverse.numconnused-lastsampletime = 1443799011243
server.resources.dvnDbPool.dataverse.numconnused-lowwatermark = 0
server.resources.dvnDbPool.dataverse.numconnused-name = NumConnUsed
server.resources.dvnDbPool.dataverse.numconnused-starttime = 1443798024035
server.resources.dvnDbPool.dataverse.numconnused-unit = count
server.resources.dvnDbPool.dotted-name = server.resources.dvnDbPool
server.resources.dvnDbPool.numconnacquired-count = 6221
server.resources.dvnDbPool.numconnacquired-description = Number of logical connections acquired from the pool.
server.resources.dvnDbPool.numconnacquired-lastsampletime = 1443799011242
server.resources.dvnDbPool.numconnacquired-name = NumConnAcquired
server.resources.dvnDbPool.numconnacquired-starttime = 1443798006517
server.resources.dvnDbPool.numconnacquired-unit = count
server.resources.dvnDbPool.numconncreated-count = 22
server.resources.dvnDbPool.numconncreated-description = The number of physical connections that were created since the last reset.
server.resources.dvnDbPool.numconncreated-lastsampletime = 1443798908065
server.resources.dvnDbPool.numconncreated-name = NumConnCreated
server.resources.dvnDbPool.numconncreated-starttime = 1443798006517
server.resources.dvnDbPool.numconncreated-unit = count
server.resources.dvnDbPool.numconndestroyed-count = 14
server.resources.dvnDbPool.numconndestroyed-description = Number of physical connections that were destroyed since the last reset.
server.resources.dvnDbPool.numconndestroyed-lastsampletime = 1443798908058
server.resources.dvnDbPool.numconndestroyed-name = NumConnDestroyed
server.resources.dvnDbPool.numconndestroyed-starttime = 1443798006517
server.resources.dvnDbPool.numconndestroyed-unit = count
server.resources.dvnDbPool.numconnfailedvalidation-count = 0
server.resources.dvnDbPool.numconnfailedvalidation-description = The total number of connections in the connection pool that failed validation from the start time until the last sample time.
server.resources.dvnDbPool.numconnfailedvalidation-lastsampletime = -1
server.resources.dvnDbPool.numconnfailedvalidation-name = NumConnFailedValidation
server.resources.dvnDbPool.numconnfailedvalidation-starttime = 1443798006517
server.resources.dvnDbPool.numconnfailedvalidation-unit = count
server.resources.dvnDbPool.numconnfree-current = 8
server.resources.dvnDbPool.numconnfree-description = The total number of free connections in the pool as of the last sampling.
server.resources.dvnDbPool.numconnfree-highwatermark = 8
server.resources.dvnDbPool.numconnfree-lastsampletime = 1443799011243
server.resources.dvnDbPool.numconnfree-lowwatermark = 0
server.resources.dvnDbPool.numconnfree-name = NumConnFree
server.resources.dvnDbPool.numconnfree-starttime = 1443798006509
server.resources.dvnDbPool.numconnfree-unit = count
server.resources.dvnDbPool.numconnnotsuccessfullymatched-count = 0
server.resources.dvnDbPool.numconnnotsuccessfullymatched-description = Number of connections rejected during matching
server.resources.dvnDbPool.numconnnotsuccessfullymatched-lastsampletime = -1
server.resources.dvnDbPool.numconnnotsuccessfullymatched-name = NumConnNotSuccessfullyMatched
server.resources.dvnDbPool.numconnnotsuccessfullymatched-starttime = 1443798006517
server.resources.dvnDbPool.numconnnotsuccessfullymatched-unit = count
server.resources.dvnDbPool.numconnreleased-count = 6221
server.resources.dvnDbPool.numconnreleased-description = Number of logical connections released to the pool.
server.resources.dvnDbPool.numconnreleased-lastsampletime = 1443799011243
server.resources.dvnDbPool.numconnreleased-name = NumConnReleased
server.resources.dvnDbPool.numconnreleased-starttime = 1443798006517
server.resources.dvnDbPool.numconnreleased-unit = count
server.resources.dvnDbPool.numconnsuccessfullymatched-count = 0
server.resources.dvnDbPool.numconnsuccessfullymatched-description = Number of connections succesfully matched
server.resources.dvnDbPool.numconnsuccessfullymatched-lastsampletime = -1
server.resources.dvnDbPool.numconnsuccessfullymatched-name = NumConnSuccessfullyMatched
server.resources.dvnDbPool.numconnsuccessfullymatched-starttime = 1443798006517
server.resources.dvnDbPool.numconnsuccessfullymatched-unit = count
server.resources.dvnDbPool.numconntimedout-count = 0
server.resources.dvnDbPool.numconntimedout-description = The total number of connections in the pool that timed out between the start time and the last sample time.
server.resources.dvnDbPool.numconntimedout-lastsampletime = -1
server.resources.dvnDbPool.numconntimedout-name = NumConnTimedOut
server.resources.dvnDbPool.numconntimedout-starttime = 1443798006517
server.resources.dvnDbPool.numconntimedout-unit = count
server.resources.dvnDbPool.numconnused-current = 0
server.resources.dvnDbPool.numconnused-description = Provides connection usage statistics. The total number of connections that are currently being used, as well as information about the maximum number of connections that were used (the high water mark).
server.resources.dvnDbPool.numconnused-highwatermark = 2
server.resources.dvnDbPool.numconnused-lastsampletime = 1443799011243
server.resources.dvnDbPool.numconnused-lowwatermark = 0
server.resources.dvnDbPool.numconnused-name = NumConnUsed
server.resources.dvnDbPool.numconnused-starttime = 1443798006509
server.resources.dvnDbPool.numconnused-unit = count
server.resources.dvnDbPool.numpotentialconnleak-count = 0
server.resources.dvnDbPool.numpotentialconnleak-description = Number of potential connection leaks
server.resources.dvnDbPool.numpotentialconnleak-lastsampletime = -1
server.resources.dvnDbPool.numpotentialconnleak-name = NumPotentialConnLeak
server.resources.dvnDbPool.numpotentialconnleak-starttime = 1443798006517
server.resources.dvnDbPool.numpotentialconnleak-unit = count
server.resources.dvnDbPool.waitqueuelength-count = 0
server.resources.dvnDbPool.waitqueuelength-description = Number of connection requests in the queue waiting to be serviced.
server.resources.dvnDbPool.waitqueuelength-lastsampletime = -1
server.resources.dvnDbPool.waitqueuelength-name = WaitQueueLength
server.resources.dvnDbPool.waitqueuelength-starttime = 1443798006517
server.resources.dvnDbPool.waitqueuelength-unit = count

Command get executed successfully.
murphy:~ pdurbin$ 

Here are other things we can apparently monitor:

murphy:~ pdurbin$ /Applications/NetBeans/glassfish4/bin/asadmin get server-config.monitoring-service.*
server-config.monitoring-service.module-monitoring-levels.cloud=OFF
server-config.monitoring-service.module-monitoring-levels.cloud-elasticity=OFF
server-config.monitoring-service.module-monitoring-levels.cloud-orchestrator=OFF
server-config.monitoring-service.module-monitoring-levels.cloud-tenant-manager=OFF
server-config.monitoring-service.module-monitoring-levels.cloud-virt-assembly-service=OFF
server-config.monitoring-service.module-monitoring-levels.connector-connection-pool=OFF
server-config.monitoring-service.module-monitoring-levels.connector-service=OFF
server-config.monitoring-service.module-monitoring-levels.deployment=OFF
server-config.monitoring-service.module-monitoring-levels.ejb-container=OFF
server-config.monitoring-service.module-monitoring-levels.http-service=OFF
server-config.monitoring-service.module-monitoring-levels.jdbc-connection-pool=HIGH
server-config.monitoring-service.module-monitoring-levels.jersey=OFF
server-config.monitoring-service.module-monitoring-levels.jms-service=OFF
server-config.monitoring-service.module-monitoring-levels.jpa=OFF
server-config.monitoring-service.module-monitoring-levels.jvm=OFF
server-config.monitoring-service.module-monitoring-levels.orb=OFF
server-config.monitoring-service.module-monitoring-levels.security=OFF
server-config.monitoring-service.module-monitoring-levels.thread-pool=OFF
server-config.monitoring-service.module-monitoring-levels.transaction-service=OFF
server-config.monitoring-service.module-monitoring-levels.web-container=OFF
server-config.monitoring-service.module-monitoring-levels.web-services-container=OFF
server-config.monitoring-service.dtrace-enabled=false
server-config.monitoring-service.mbean-enabled=true
server-config.monitoring-service.monitoring-enabled=true
Command get executed successfully.
murphy:~ pdurbin$ 

http://www.slideshare.net/bhaktiks/50-tips50minutes is also helpful for monitoring tips.

pdurbin commented 8 years ago

asadmin get -m server.resources.dvnDbPool.dataverse.numconnused-highwatermark

@kcondon and I noticed that rather than "dataverse" in production the deployed app has a version number in it. You can see it with asadmin get -m "*".

We also decided to add this to the Google Doc about improving the guides (thanks again @bencomp ) for people installing Dataverse, which I just did: https://docs.google.com/document/d/1GLky44JbtCNhGuavLXGEP8WPNdjbyHNxalEGjR78MaA/edit?usp=sharing

pdurbin commented 8 years ago

I'm still trying to figure out which values will be of interest in production. I find connrequestwaittime-highwatermark (units are in milliseconds) and dataverse-4\.2-92.numconnused-highwatermark interesting. Both servers have been running for a while and seem more or less fine right now:

kirk:

server.resources.dvnDbPool.connrequestwaittime-highwatermark = 233
server.resources.dvnDbPool.dataverse-4\.2-92.numconnused-highwatermark = 16

picard:

server.resources.dvnDbPool.connrequestwaittime-highwatermark = 535
server.resources.dvnDbPool.dataverse-4\.2-92.numconnused-highwatermark = 22

I found http://alexandru-ersenie.com/2014/02/18/one-step-monitoring-of-key-indicators-in-glassfish-3-1-via-rest/ interesting, which lead me to look a little at the REST API for Glassfish. We can get data out of the following endpoints, for example:

curl -H "Accept: application/json" http://localhost:4848/management/domain/configs/config/server-config/monitoring-service/module-monitoring-levels

curl -H "Accept: application/json" http://localhost:4848/management/domain/resources/jdbc-connection-pool/dvnDbPool

Finally http://lightfish.adam-bien.com seems interesting ("LightFish is an opensource Java EE 7 application which gathers and persists periodically GlassFish's monitoring data") but https://github.com/AdamBien/lightfish/issues/27 makes me unsure if it'll work with Glassfish 4.1 and I'm not sure we can actually deploy it along side out war since our war takes over the root ("/") of the context-root.

Anyway, I'm still exploring but my immediate goal is to capture some useful metric periodically to a file on a regular interval (5 minutes? 15 minutes?) so we can see if any measure such as server.resources.dvnDbPool.dataverse-4\.2-92.numconnused-current start to go haywire before a crash. Strangely though, that value is zero right now for both kirk and picard. Maybe it isn't the right value to measure? Or maybe it goes up suddenly before a crash?

pdurbin commented 8 years ago

Here's how we can get the data out as JSON (this is kirk):

curl -H "Accept: application/json" http://localhost:4848/monitoring/domain/server/resources/dvnDbPool | jq .

{
  "extraProperties": {
    "childResources": {
      "dataverse-4.2-92": "http://localhost:4848/monitoring/domain/server/resources/dvnDbPool/dataverse-4.2-92"
    },
    "entity": {
      "waitqueuelength": {
        "starttime": 1444691946537,
        "name": "WaitQueueLength",
        "unit": "count",
        "description": "Number of connection requests in the queue waiting to be serviced.",
        "lastsampletime": -1,
        "count": 0
      },
      "numpotentialconnleak": {
        "starttime": 1444691946537,
        "name": "NumPotentialConnLeak",
        "unit": "count",
        "description": "Number of potential connection leaks",
        "lastsampletime": -1,
        "count": 0
      },
      "numconnused": {
        "lowwatermark": 0,
        "current": 0,
        "starttime": 1444691946529,
        "name": "NumConnUsed",
        "unit": "count",
        "description": "Provides connection usage statistics. The total number of connections that are currently being used, as well as information about the maximum number of connections that were used (the high water mark).",
        "lastsampletime": 1444759815240,
        "highwatermark": 16
      },
      "numconntimedout": {
        "starttime": 1444691946537,
        "name": "NumConnTimedOut",
        "unit": "count",
        "description": "The total number of connections in the pool that timed out between the start time and the last sample time.",
        "lastsampletime": -1,
        "count": 0
      },
      "numconnsuccessfullymatched": {
        "starttime": 1444691946537,
        "name": "NumConnSuccessfullyMatched",
        "unit": "count",
        "description": "Number of connections succesfully matched",
        "lastsampletime": -1,
        "count": 0
      },
      "numconnreleased": {
        "starttime": 1444691946537,
        "name": "NumConnReleased",
        "unit": "count",
        "description": "Number of logical connections released to the pool.",
        "lastsampletime": 1444759815240,
        "count": 6676890
      },
      "averageconnwaittime": {
        "starttime": 1444691946537,
        "name": "AverageConnWaitTime",
        "unit": "millisecond",
        "description": "Average wait-time-duration per successful connection request",
        "lastsampletime": 1444759826608,
        "count": 0
      },
      "connrequestwaittime": {
        "lowwatermark": 0,
        "current": 0,
        "starttime": 1444691946537,
        "name": "ConnRequestWaitTime",
        "unit": "millisecond",
        "description": "The longest and shortest wait times of connection requests. The current value indicates the wait time of the last request that was serviced by the pool.",
        "lastsampletime": 1444759815239,
        "highwatermark": 233
      },
      "numconnacquired": {
        "starttime": 1444691946537,
        "name": "NumConnAcquired",
        "unit": "count",
        "description": "Number of logical connections acquired from the pool.",
        "lastsampletime": 1444759815239,
        "count": 6676890
      },
      "numconncreated": {
        "starttime": 1444691946537,
        "name": "NumConnCreated",
        "unit": "count",
        "description": "The number of physical connections that were created since the last reset.",
        "lastsampletime": 1444759767409,
        "count": 491
      },
      "numconndestroyed": {
        "starttime": 1444691946537,
        "name": "NumConnDestroyed",
        "unit": "count",
        "description": "Number of physical connections that were destroyed since the last reset.",
        "lastsampletime": 1444759767404,
        "count": 483
      },
      "numconnfailedvalidation": {
        "starttime": 1444691946537,
        "name": "NumConnFailedValidation",
        "unit": "count",
        "description": "The total number of connections in the connection pool that failed validation from the start time until the last sample time.",
        "lastsampletime": -1,
        "count": 0
      },
      "numconnfree": {
        "lowwatermark": 0,
        "current": 8,
        "starttime": 1444691946529,
        "name": "NumConnFree",
        "unit": "count",
        "description": "The total number of free connections in the pool as of the last sampling.",
        "lastsampletime": 1444759815240,
        "highwatermark": 16
      },
      "numconnnotsuccessfullymatched": {
        "starttime": 1444691946537,
        "name": "NumConnNotSuccessfullyMatched",
        "unit": "count",
        "description": "Number of connections rejected during matching",
        "lastsampletime": -1,
        "count": 0
      }
    }
  },
  "exit_code": "SUCCESS",
  "command": "Monitoring Data",
  "message": ""
}

curl -H "Accept: application/json" http://localhost:4848/monitoring/domain/server/resources/dvnDbPool/dataverse-4.2-92 | jq .

{
  "extraProperties": {
    "childResources": {},
    "entity": {
      "numconnused": {
        "lowwatermark": 0,
        "current": 0,
        "starttime": 1444691967411,
        "name": "NumConnUsed",
        "unit": "count",
        "description": "Provides connection usage statistics. The total number of connections that are currently being used, as well as information about the maximum number of connections that were used (the high water mark).",
        "lastsampletime": 1444759830304,
        "highwatermark": 16
      },
      "numconnreleased": {
        "starttime": 1444691967411,
        "name": "NumConnReleased",
        "unit": "count",
        "description": "Number of logical connections released to the pool.",
        "lastsampletime": 1444759830304,
        "count": 6677049
      },
      "numconnacquired": {
        "starttime": 1444691967411,
        "name": "NumConnAcquired",
        "unit": "count",
        "description": "Number of logical connections acquired from the pool.",
        "lastsampletime": 1444759830303,
        "count": 6677049
      }
    }
  },
  "exit_code": "SUCCESS",
  "command": "Monitoring Data",
  "message": ""
}
bencomp commented 8 years ago

Answering my own question: it looks like you do reuse connections, as the numbers of connections acquired and released are much greater than the numbers of connections created and destroyed.

In the admin GUI there are options for lazy association, thread association and leak detection. How did you configure these?

pdurbin commented 8 years ago

In 922e9f5 I wrote a monitor.py script that calls into the Glassfish API and processes the JSON I mentioned at https://github.com/IQSS/dataverse/issues/2595#issuecomment-147800081 to produce three tsv files:

(You can see an example of numconnacquired.tsv in the commit above.)

The first column is "lastsampletime" and I used the "yyyy-mm-dd HH:MM:SS.ssssss" time format mentioned at http://help.plot.ly/date-format-and-time-series/ so that we can simply upload the tsv files to @plotly for some times graphs. I also wrote included in my commit a proof of concept plot.ply script for creating plots with matplotlib. It's ugly, but it's better than nothing:

numconnacquired

My vision for this monitoring is not to write tsv files forever. Ideally, the data instead would be taken into a system like Ganglia, Munin, Nagios or Opsview or whatever the cool kids use these days (Prometheus?).

Anyway, the tsv files are linked from https://apitest.dataverse.org/guides/developers/issues/2595/index.html and the the script runs from cron every five minutes with */5 * * * * cd /usr/local/glassfish4/glassfish/domains/domain1/docroot/guides/developers/issues/2595 && ./monitor.py though in practice, "lastsampletime" indicates that sampling only happens every 15 minutes. I wrote the script so that even if you run it over and over, it won't simply add duplicate lines to the file.

pdurbin commented 8 years ago

@kcondon as we discussed this is something of a "research" issue where the deliverable may be advice in the Installation Guide. I'm passing this issue to you so you can comment on the tuning changes that you and @landreev have been working on. Maybe he can document them "for real" in a guide.

@kcondon if you could also address the question by @bencomp at https://github.com/IQSS/dataverse/issues/2595#issuecomment-147985033 it would be great.

I think the jury is still out on if my monitor.py script is actually collecting useful data. In INC01464301 I'll continue to work on using it in production.

kcondon commented 8 years ago

In addition to monitoring, we have increased the max db pool connections to be 64 on each web server in domain.xml:

and tuned postgres connection and memory parameters in postgresql.conf: max_connections = 200 shared_buffers = 10GB work_mem = 24MB Note that the postgres settings are based on combined web servers for connections, memory settings are based on system memory available to postgres and work_mem is per connection/per call so these need to fit in available memory and may not be generally applicable. After making these changes we have not seen the error allocating connections error, since 10/6 on one host and 10/10 on the other and highwater for db pool conns is 41 and 46 on each host. The default max was 32 and would account for the allocation error. Closing.