Prometheus Exporter for Jira cause jira to become unresponsive?

AntonPakholchuk commented 5 years ago

HI Andrey, At some point of time Jira became unresponsive for user requests but remained running. Thread dumps analysis showed this.

The thread dumps and CPU reading show only one thread actively doing work, and most of the other threads are blocked in (WAITING - parked) state. This is the thread from the top outputs:

`"http-nio-8080-exec-136 url:/plugins/servlet/gadgets/ifr username:user_x " #333526 daemon prio=5 os_prio=0 tid=0x00007fd44803a800 nid=0x112dc runnable [0x00007fd1c07ef000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)

locked <0x00007fd7210085f0> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
locked <0x00007fd72100a6c0> (a sun.security.ssl.AppInputStream) `

This thread that is RUNNABLE across the entire capture period is a pool thread (background daemon, not a user service thread):

`"pool-11-thread-1" #90 prio=1 os_prio=0 tid=0x00007fd4e02dd800 nid=0x4739 runnable [0x00007fd495a45000] java.lang.Thread.State: RUNNABLE at sun.nio.fs.UnixNativeDispatcher.readdir(Native Method) at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.readNextEntry(UnixDirectoryStream.java:168) at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.hasNext(UnixDirectoryStream.java:201)

locked <0x00007fd981e0c538> (a sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator) at java.nio.file.FileTreeWalker.next(FileTreeWalker.java:348) at java.nio.file.Files.walkFileTree(Files.java:2706) at java.nio.file.Files.walkFileTree(Files.java:2742) at ru.andreymarkelov.atlas.plugins.promjiraexporter.service.ScheduledMetricEvaluatorImpl.calculateAttachmentSize(ScheduledMetricEvaluatorImpl.java:118) at ru.andreymarkelov.atlas.plugins.promjiraexporter.service.ScheduledMetricEvaluatorImpl.access$100(ScheduledMetricEvaluatorImpl.java:28) at ru.andreymarkelov.atlas.plugins.promjiraexporter.service.ScheduledMetricEvaluatorImpl$2.run(ScheduledMetricEvaluatorImpl.java:108) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) `

It appears to be monitoring or iterating over a directory structure, related to Jira Exporter plugin. Additionally there is only one RUNNABLE threads which services a user request:

"http-nio-8080-exec-136 url:/plugins/servlet/gadgets/ifr username:user_x " #333526 daemon prio=5 os_prio=0 tid=0x00007fd44803a800 nid=0x112dc runnable [0x00007fd1c07ef000]

The thread dump is basically showing the locks placed by the thread while it's waiting for data:

`"http-nio-8080-exec-136 url:/plugins/servlet/gadgets/ifr username:user_x " #333526 daemon prio=5 os_prio=0 tid=0x00007fd44803a800 nid=0x112dc runnable [0x00007fd1c07ef000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)

locked <0x00007fd7210085f0> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
locked <0x00007fd72100a6c0> (a sun.security.ssl.AppInputStream) `

All of the other http threads for user requests are on a WAITING - parked state, which is why the application was unresponsive. It appears that they are waiting for a lock release from one of the threads mentioned above. For example:

`"http-nio-8080-exec-46 url:/plugins/servlet/gadgets/ifr username:user_z " #58048 daemon prio=5 os_prio=0 tid=0x00007fd448015800 nid=0x1960e waiting on condition [0x00007fd38119e000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method)

parking to wait for <0x00007fd68c4eb2e0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) `

GC and heap analysis show no memory related issues, so GC or memory usage is not causing any problem with the application

The only http thread actively running was this:

`"http-nio-8080-exec-136 url:/plugins/servlet/gadgets/ifr username:user_x " #333526 daemon prio=5 os_prio=0 tid=0x00007fd44803a800 nid=0x112dc runnable [0x00007fd1c07ef000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)

locked <0x00007fd7210085f0> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
locked <0x00007fd72100a6c0> (a sun.security.ssl.AppInputStream) `

As the other http threads were waiting to acquire a lock, it looks like the exec-136 was itself being unable to retrieve data from cache. Since the only other active thread for the timeframe was the pool-thread which makes use of Prometheus, we make the inference that it is some functionality in the plugin which took a long time and was preventing the cache lock from being released

Plugin version: 1.0.20 Jira: Data Center 7.6.1

Are there any known cases similar to this one? Thank you.

AntonPakholchuk commented 5 years ago

At the same time the File Descriptor Saturation graph shows an exponentially growing read operation, which correlates with

at sun.nio.fs.UnixNativeDispatcher.readdir(Native Method) at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.readNextEntry(UnixDirectoryStream.java:168) at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.hasNext(UnixDirectoryStream.java:201)

AndreyVMarkelov commented 5 years ago

I will remove that. Thank you!

пт, 18 янв. 2019 г. в 14:47, AntonPakholchuk notifications@github.com:

At the same time the File Descriptor Saturation graph shows an exponentially growing read operation, which correlates with

at sun.nio.fs.UnixNativeDispatcher.readdir(Native Method) at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.readNextEntry(UnixDirectoryStream.java:168) at sun.nio.fs.UnixDirectoryStream$UnixDirectoryIterator.hasNext(UnixDirectoryStream.java:201)

[image: image] https://user-images.githubusercontent.com/11921816/51385268-d277f300-1b2f-11e9-8632-4612e9dc7eb1.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AndreyVMarkelov/jira-prometheus-exporter/issues/43#issuecomment-455520341, or mute the thread https://github.com/notifications/unsubscribe-auth/AB92cFS9uohByCurBlev9Q4c9vPZfCP5ks5vEbROgaJpZM4aHwJY .

-- -- Andrey Markelov

AntonPakholchuk commented 5 years ago

Hi Andrey, thanks for the prompt reply. Can you share your point of view if this is Prometheus activity that causes Jira to become unresponsive? And what the issue could be in particular? Thank you.

AndreyVMarkelov commented 5 years ago

in progress now

towolf commented 5 years ago

@AndreyVMarkelov still in progress?

We had to disable the plugin, because it made Jira a lot slower, some pages took very long to load.

After disabling the plugin, things went back to normal.

AndreyVMarkelov commented 5 years ago

Hi, could you try please just set 0 in settings for scheduled job?

вт, 28 Май 2019 г., 1:50 Tobias Wolf notifications@github.com:

@AndreyVMarkelov https://github.com/AndreyVMarkelov still in progress?

We had to disable the plugin, because it made Jira a lot slower, some pages took very long to load.

After disabling the plugin, things went back to normal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AndreyVMarkelov/jira-prometheus-exporter/issues/43?email_source=notifications&email_token=AAPXM4F3KWCYQBXW5MENT53PXRQVFA5CNFSM4GQ7AJMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWKTETY#issuecomment-496317007, or mute the thread https://github.com/notifications/unsubscribe-auth/AAPXM4BUMPKSJ6A372TQMD3PXRQVFANCNFSM4GQ7AJMA .

AndreyVMarkelov commented 5 years ago

@towolf @AntonPakholchuk It should be fixed now. Could you try please?

towolf commented 5 years ago

@AndreyVMarkelov my colleague just mentioned in passing that we have reenabled the plug-in about to weeks ago, and it has been working fine. No slowdowns anymore.

So, thanks!

On Wed, May 29, 2019, 18:42 Andrey Markelov notifications@github.com wrote:

@towolf https://github.com/towolf @AntonPakholchuk https://github.com/AntonPakholchuk It should be fixed now. Could you try please?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/AndreyVMarkelov/jira-prometheus-exporter/issues/43?email_source=notifications&email_token=AACUI6XCUHOCUYVBDK6W7VTPX2W6NA5CNFSM4GQ7AJMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWP5ZHA#issuecomment-497015964, or mute the thread https://github.com/notifications/unsubscribe-auth/AACUI6T2JPY26RGDKEZTMZ3PX2W6NANCNFSM4GQ7AJMA .

AndreyVMarkelov / jira-prometheus-exporter

Prometheus Exporter for Jira cause jira to become unresponsive? #43