GridProtectionAlliance / openHistorian

The Open Source Time-Series Data Historian
MIT License
172 stars 51 forks source link

Service unexpected stop #52

Closed wwang72 closed 4 years ago

wwang72 commented 4 years ago

I had an exception that caused the openHistorian service to stop. After my initial investigation, I found there were several TLS issues reported by my Microsoft event viewer Log Name: System Source: Service Control Manager Date: 12/2/2019 5:56:48 PM Event ID: 7034 Task Category: None Level: Error Keywords: Classic User: N/A Computer: server Description: The openHistorian service terminated unexpectedly. It has done this 1 time(s). Event Xml: `

7034 0 2 0 0 0x8080000000000000 96351 System server openHistorian 1 6F00700065006E0048006900730074006F007200690061006E000000 ` Log Name: System Source: Schannel Date: 12/2/2019 5:56:46 PM Event ID: 36874 Task Category: None Level: Error Keywords: User: SYSTEM Computer: server Description: An TLS 1.1 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The TLS connection request has failed. Event Xml: ` 36874 0 2 0 0 0x8000000000000000 96350 System server TLS 1.1 ` Log Name: System Source: Schannel Date: 12/2/2019 5:56:37 PM Event ID: 36874 Task Category: None Level: Error Keywords: User: SYSTEM Computer: server Description: An TLS 1.0 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The TLS connection request has failed. Event Xml: ` 36874 0 2 0 0 0x8000000000000000 96349 System server TLS 1.0 ` Log Name: System Source: Schannel Date: 12/2/2019 5:56:27 PM Event ID: 36874 Task Category: None Level: Error Keywords: User: SYSTEM Computer: server Description: An TLS 1.2 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The TLS connection request has failed. Event Xml: ` 36874 0 2 0 0 0x8000000000000000 96348 System server TLS 1.2 ` Log Name: System Source: Schannel Date: 12/2/2019 5:56:17 PM Event ID: 36874 Task Category: None Level: Error Keywords: User: SYSTEM Computer: server Description: An TLS 1.1 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The TLS connection request has failed. Event Xml: ` 36874 0 2 0 0 0x8000000000000000 96347 System server TLS 1.1 ` Log Name: System Source: Schannel Date: 12/2/2019 5:56:17 PM Event ID: 36874 Task Category: None Level: Error Keywords: User: SYSTEM Computer: server Description: An TLS 1.1 connection request was received from a remote client application, but none of the cipher suites supported by the client application are supported by the server. The TLS connection request has failed. Event Xml: ` 36874 0 2 0 0 0x8000000000000000 96346 System server TLS 1.1 ` At the same time on the openHistorian side, I found logs below ![Screen Shot 2019-12-03 at 3 00 09 PM](https://user-images.githubusercontent.com/7548029/70085215-ced14b80-15dd-11ea-96cd-2c355c5c8784.png) Any ideas from the logs?
StephenCWills commented 4 years ago

Are there any errors in the Event Viewer's Application logs around the time of the unexpected service termination message?

wwang72 commented 4 years ago

I didn't see much useful information except for something below appearing every other minute.

Log Name: Application Source: openHistorian Date: 12/2/2019 5:55:47 PM Event ID: 0 Task Category: None Level: Information Keywords: Classic,Audit Success User: N/A Computer: server Description: Subsequent Encounter: user "UTK\user" granted access with role "Administrator" - role assignment is the same as last login. Event Xml: `

0 0 0 0xa0000000000000 6224500 Application server Subsequent Encounter: user "UTK\user" granted access with role "Administrator" - role assignment is the same as last login. `
ritchiecarroll commented 4 years ago

Can you let me know which version of the openHistorian you are using?

Thanks! Ritchie

wwang72 commented 4 years ago

Can you let me know which version of the openHistorian you are using?

Thanks! Ritchie

I am using version 2.5.6.

wwang72 commented 4 years ago

Can you let me know which version of the openHistorian you are using?

Thanks! Ritchie

Just an update. We installed openHistorian on 2 machines. Both of them are seeing memory leaks.

ritchiecarroll commented 4 years ago

It is possible to upgrade to version 2.6?

wwang72 commented 4 years ago

It is possible to upgrade to version 2.6?

Yes. I did upgrade one of the servers to 2.6 and the same issue exists.

ritchiecarroll commented 4 years ago

I've not detected any memory leaks in application instances we have running in production - some instances run many months on end with 300+ PMUs. Can you fully describe your system configuration, custom adapters, protocols in use, etc?

Thanks, Ritchie

wwang72 commented 4 years ago

I've not detected any memory leaks in application instances we have running in production - some instances run many months on end with 300+ PMUs. Can you fully describe your system configuration, custom adapters, protocols in use, etc?

Thanks, Ritchie

The server runs within the University of Tennessee domain. It connects to around 150 devices, with each streaming data at 10 points/sec. The data is written into the default local openHistorian instance. No data forwarding is configured. The server runs Windows Server 2016. No custom adapters are installed.

ritchiecarroll commented 4 years ago

Are all devices using FNET protocol, or IEEE C37.118? Also, are there any automated queries?

On thing you can try is this:

1) Run the XML Configuration Editor (found in Windows Start Menu under openHistorian) 2) Open section called "systemSettings" 3) Find key called "MemoryPoolSize" 4) Set value to 50% of available system memory - value units is in Gigabytes 5) Click "Save Settings"

This should automatically restart openHistorian service and limit memory usage to specified value.

Thanks, Ritchie

wwang72 commented 4 years ago

Are all devices using FNET protocol, or IEEE C37.118? Also, are there any automated queries?

On thing you can try is this:

  1. Run the XML Configuration Editor (found in Windows Start Menu under openHistorian)
  2. Open section called "systemSettings"
  3. Find key called "MemoryPoolSize"
  4. Set value to 50% of available system memory - value units is in Gigabytes
  5. Click "Save Settings"

This should automatically restart openHistorian service and limit memory usage to specified value.

Thanks, Ritchie

Thanks, Ritchie. There is a .NET based web service that provides query function to some outer applications. There are around 6 outer applications calling the web service to query the data every 4 seconds. The estimated amount of the requested data is relatively small (from the past 15 seconds with 4 seconds interval).

ritchiecarroll commented 4 years ago

Thanks - generally queries allocate memory that is not immediately released, since the system will assume that this memory may be needed again for future queries. Let me know if the recommended setting updates helps.

Ritchie

wwang72 commented 4 years ago

Thanks - generally queries allocate memory that is not immediately released, since the system will assume that this memory may be needed again for future queries. Let me know if the recommended setting updates helps.

Ritchie

I stopped the web service as well as set the MemoryPoolSize to 16GB out of the total 32GB. The memory leak still exists. I noticed the CPU usage is also abnormally high. It can consume 40%+ of my 16-core CPU, when the memory reaches 99%.

ritchiecarroll commented 4 years ago

If you pull up task manager, can you confirm that it is the openHistorian.exe task that is using the CPU/RAM?

Also, there should be log files in the install folder of openHistorian, such as, C:\Program Files\openHistorian\Logs - these files end in .logz. Attach a couple of the most recent files here so I can review the cause of your issue.

Thanks, Ritchie

wwang72 commented 4 years ago

If you pull up task manager, can you confirm that it is the openHistorian.exe task that is using the CPU/RAM?

Also, there should be log files in the install folder of openHistorian, such as, C:\Program Files\openHistorian\Logs - these files end in .logz. Attach a couple of the most recent files here so I can review the cause of your issue.

Thanks, Ritchie

Thanks. Yes. it is the openHistorian.exe task. I am attaching some copies of the most recent logs. 20191231-171633-957732 - openHistorian - 1.log 20191231-173949-535061 - openHistorian - 1.log 20191231-173841-137858 - openHistorian - 1.log 20191231-173038-743754 - openHistorian - 1.log Please be aware that github does not permit .logz files. I renamed them as .log files and please change the name when troubleshooting.

ritchiecarroll commented 4 years ago

One thing I noticed was this, it looks like things are caught in a loop waiting for initial data, see this error: No data received in 5.0 seconds, restarting connect cycle...

This is causing a system cascade of restart events for all FNET devices. Because there are so many simultaneous connections, I think you need to give the system a longer time to start receiving data. If your configuration is SQL Server based, you can execute the following command to update all devices to wait at least 60 seconds for data before restarting connection: UPDATE [openPDC].[dbo].[Device] SET [DataLossInterval] = 60

I suggest restarting openHistorian service after this update.

Ritchie

ritchiecarroll commented 4 years ago

Also, seeing this error: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range. ---> System.OverflowException: Array dimensions exceeded supported range.

This seems like a possible protocol exception.

What protocol are you using to receive data? FNET or IEEE C37.118?

ritchiecarroll commented 4 years ago

Here's another exception: Connection attempt failed for : Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0; islistener=True; timeOffset=109000000; stationName=F-NET Unit; frameRate=10; nominalFrequency=50;isConcentrator=0;accessID=3000;;timeAdjustmentTicks=0;phasorProtocol=FNet;dataLossInterval=5;allowedParsingExceptions=10;parsingExceptionWindow=5;delayedConnectionInterval=5;allowUseOfCachedConfiguration=1;autoStartDataParsingSequence=1;skipDisableRealTimeData=0;measurementReportingInterval=100000;connectOnDemand=1"

This seems to indicate that the protocol being used is FNET - for some devices it looks like the receiving port number is not unique.

If you want to use the same port number for all incoming connections, you will need to use a "shared connection".

wwang72 commented 4 years ago

Here's another exception: Connection attempt failed for : Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0; islistener=True; timeOffset=109000000; stationName=F-NET Unit; frameRate=10; nominalFrequency=50;isConcentrator=0;accessID=3000;;timeAdjustmentTicks=0;phasorProtocol=FNet;dataLossInterval=5;allowedParsingExceptions=10;parsingExceptionWindow=5;delayedConnectionInterval=5;allowUseOfCachedConfiguration=1;autoStartDataParsingSequence=1;skipDisableRealTimeData=0;measurementReportingInterval=100000;connectOnDemand=1"

This seems to indicate that the protocol being used is FNET - for some devices it looks like the receiving port number is not unique.

If you want to use the same port number for all incoming connections, you will need to use a "shared connection".

Thank you, Ritchie. I resolved the port confliction issue and I adjusted the DataLossInterval. The memory leak issue still exists. I restarted the service and shut it down after 2 minutes. I am attaching the log that contains system information for this period. 20191231-203616-185124 - openHistorian - 1 (2).log 20191231-203616-185124 - openHistorian - 1 (1).log

ritchiecarroll commented 4 years ago

A few things:

(1) This message still says No data received in 5.0 seconds, restarting connect cycle..., so it looks like 60 second connection cycling was not properly applied.

(2) There is continual failure with F-NET protocol parsing which causes connection cycling: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range.

(3) This error also keep happening: Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0; - however this could be related to connection cycling.


Have you tried connecting with PMU Connection Tester?

Is there a way I can get an F-NET stream directed here for triage?

Thanks, Ritchie

wwang72 commented 4 years ago

A few things:

(1) This message still says No data received in 5.0 seconds, restarting connect cycle..., so it looks like 60 second connection cycling was not properly applied.

(2) There is continual failure with F-NET protocol parsing which causes connection cycling: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range.

(3) This error also keep happening: Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0; - however this could be related to connection cycling.

Have you tried connecting with PMU Connection Tester?

Is there a way I can get an F-NET stream directed here for triage?

Thanks, Ritchie

Ritchie, My apologies for the log. It was recorded before the issues were resolved. Please see the attached log 20191231-203730-475463 - openHistorian - 1.log I can open a data stream from our server to your end. Please provide an IP address if you will.

ritchiecarroll commented 4 years ago

Same issues remain with new log file:

(1) Still looks as if reconnect after no data is still 5.0 seconds: No data received in 5.0 seconds, restarting connect cycle...

(2) This error is still happening: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range.

(3) This exception is related to IN-FQ30-3000: Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0;

Note that you can check these logs yourself by running the Log File Viewer application found in the Windows Start Menu under the openHistorian folder.


My suggestion is thus:

(1) Stop openHistorian (2) Validate that a single F-NET connection operates properly with PMU Connection Tester, if this works (3) Enable only one device F-NET device in openHistorian (same one you tested) and restart service

If you still have trouble, please send a single F-NET device to GPA IP (check nslookup gridprotectionalliance.org) then let me know port so we can temporarily open fire wall.

Thanks, Ritchie

ritchiecarroll commented 4 years ago

Going to close this issue, if you have any further need, please send a note to our discussions board:

https://discussions.gridprotectionalliance.org/c/gpa-products/openhistorian

Thanks, Ritchie