Closed wwang72 closed 4 years ago
Are there any errors in the Event Viewer's Application logs around the time of the unexpected service termination message?
I didn't see much useful information except for something below appearing every other minute.
Log Name: Application Source: openHistorian Date: 12/2/2019 5:55:47 PM Event ID: 0 Task Category: None Level: Information Keywords: Classic,Audit Success User: N/A Computer: server Description: Subsequent Encounter: user "UTK\user" granted access with role "Administrator" - role assignment is the same as last login. Event Xml: `
Can you let me know which version of the openHistorian you are using?
Thanks! Ritchie
Can you let me know which version of the openHistorian you are using?
Thanks! Ritchie
I am using version 2.5.6.
Can you let me know which version of the openHistorian you are using?
Thanks! Ritchie
Just an update. We installed openHistorian on 2 machines. Both of them are seeing memory leaks.
It is possible to upgrade to version 2.6?
It is possible to upgrade to version 2.6?
Yes. I did upgrade one of the servers to 2.6 and the same issue exists.
I've not detected any memory leaks in application instances we have running in production - some instances run many months on end with 300+ PMUs. Can you fully describe your system configuration, custom adapters, protocols in use, etc?
Thanks, Ritchie
I've not detected any memory leaks in application instances we have running in production - some instances run many months on end with 300+ PMUs. Can you fully describe your system configuration, custom adapters, protocols in use, etc?
Thanks, Ritchie
The server runs within the University of Tennessee domain. It connects to around 150 devices, with each streaming data at 10 points/sec. The data is written into the default local openHistorian instance. No data forwarding is configured. The server runs Windows Server 2016. No custom adapters are installed.
Are all devices using FNET protocol, or IEEE C37.118? Also, are there any automated queries?
On thing you can try is this:
1) Run the XML Configuration Editor (found in Windows Start Menu under openHistorian) 2) Open section called "systemSettings" 3) Find key called "MemoryPoolSize" 4) Set value to 50% of available system memory - value units is in Gigabytes 5) Click "Save Settings"
This should automatically restart openHistorian service and limit memory usage to specified value.
Thanks, Ritchie
Are all devices using FNET protocol, or IEEE C37.118? Also, are there any automated queries?
On thing you can try is this:
- Run the XML Configuration Editor (found in Windows Start Menu under openHistorian)
- Open section called "systemSettings"
- Find key called "MemoryPoolSize"
- Set value to 50% of available system memory - value units is in Gigabytes
- Click "Save Settings"
This should automatically restart openHistorian service and limit memory usage to specified value.
Thanks, Ritchie
Thanks, Ritchie. There is a .NET based web service that provides query function to some outer applications. There are around 6 outer applications calling the web service to query the data every 4 seconds. The estimated amount of the requested data is relatively small (from the past 15 seconds with 4 seconds interval).
Thanks - generally queries allocate memory that is not immediately released, since the system will assume that this memory may be needed again for future queries. Let me know if the recommended setting updates helps.
Ritchie
Thanks - generally queries allocate memory that is not immediately released, since the system will assume that this memory may be needed again for future queries. Let me know if the recommended setting updates helps.
Ritchie
I stopped the web service as well as set the MemoryPoolSize to 16GB out of the total 32GB. The memory leak still exists. I noticed the CPU usage is also abnormally high. It can consume 40%+ of my 16-core CPU, when the memory reaches 99%.
If you pull up task manager, can you confirm that it is the openHistorian.exe
task that is using the CPU/RAM?
Also, there should be log files in the install folder of openHistorian, such as, C:\Program Files\openHistorian\Logs
- these files end in .logz
. Attach a couple of the most recent files here so I can review the cause of your issue.
Thanks, Ritchie
If you pull up task manager, can you confirm that it is the
openHistorian.exe
task that is using the CPU/RAM?Also, there should be log files in the install folder of openHistorian, such as,
C:\Program Files\openHistorian\Logs
- these files end in.logz
. Attach a couple of the most recent files here so I can review the cause of your issue.Thanks, Ritchie
Thanks.
Yes. it is the openHistorian.exe
task.
I am attaching some copies of the most recent logs.
20191231-171633-957732 - openHistorian - 1.log
20191231-173949-535061 - openHistorian - 1.log
20191231-173841-137858 - openHistorian - 1.log
20191231-173038-743754 - openHistorian - 1.log
Please be aware that github does not permit .logz
files. I renamed them as .log
files and please change the name when troubleshooting.
One thing I noticed was this, it looks like things are caught in a loop waiting for initial data, see this error:
No data received in 5.0 seconds, restarting connect cycle...
This is causing a system cascade of restart events for all FNET devices. Because there are so many simultaneous connections, I think you need to give the system a longer time to start receiving data. If your configuration is SQL Server based, you can execute the following command to update all devices to wait at least 60 seconds for data before restarting connection:
UPDATE [openPDC].[dbo].[Device] SET [DataLossInterval] = 60
I suggest restarting openHistorian
service after this update.
Ritchie
Also, seeing this error: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range. ---> System.OverflowException: Array dimensions exceeded supported range.
This seems like a possible protocol exception.
What protocol are you using to receive data? FNET
or IEEE C37.118
?
Here's another exception: Connection attempt failed for : Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0; islistener=True; timeOffset=109000000; stationName=F-NET Unit; frameRate=10; nominalFrequency=50;isConcentrator=0;accessID=3000;;timeAdjustmentTicks=0;phasorProtocol=FNet;dataLossInterval=5;allowedParsingExceptions=10;parsingExceptionWindow=5;delayedConnectionInterval=5;allowUseOfCachedConfiguration=1;autoStartDataParsingSequence=1;skipDisableRealTimeData=0;measurementReportingInterval=100000;connectOnDemand=1"
This seems to indicate that the protocol being used is FNET
- for some devices it looks like the receiving port number is not unique.
If you want to use the same port number for all incoming connections, you will need to use a "shared connection".
Here's another exception:
Connection attempt failed for : Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0; islistener=True; timeOffset=109000000; stationName=F-NET Unit; frameRate=10; nominalFrequency=50;isConcentrator=0;accessID=3000;;timeAdjustmentTicks=0;phasorProtocol=FNet;dataLossInterval=5;allowedParsingExceptions=10;parsingExceptionWindow=5;delayedConnectionInterval=5;allowUseOfCachedConfiguration=1;autoStartDataParsingSequence=1;skipDisableRealTimeData=0;measurementReportingInterval=100000;connectOnDemand=1"
This seems to indicate that the protocol being used is
FNET
- for some devices it looks like the receiving port number is not unique.If you want to use the same port number for all incoming connections, you will need to use a "shared connection".
Thank you, Ritchie.
I resolved the port confliction issue and I adjusted the DataLossInterval
. The memory leak issue still exists. I restarted the service and shut it down after 2 minutes. I am attaching the log that contains system information for this period.
20191231-203616-185124 - openHistorian - 1 (2).log
20191231-203616-185124 - openHistorian - 1 (1).log
A few things:
(1) This message still says No data received in 5.0 seconds, restarting connect cycle...
, so it looks like 60 second connection cycling was not properly applied.
(2) There is continual failure with F-NET protocol parsing which causes connection cycling: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range.
(3) This error also keep happening: Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0;
- however this could be related to connection cycling.
Have you tried connecting with PMU Connection Tester?
Is there a way I can get an F-NET stream directed here for triage?
Thanks, Ritchie
A few things:
(1) This message still says
No data received in 5.0 seconds, restarting connect cycle...
, so it looks like 60 second connection cycling was not properly applied.(2) There is continual failure with F-NET protocol parsing which causes connection cycling:
MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range.
(3) This error also keep happening:
Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0;
- however this could be related to connection cycling.Have you tried connecting with PMU Connection Tester?
Is there a way I can get an F-NET stream directed here for triage?
Thanks, Ritchie
Ritchie, My apologies for the log. It was recorded before the issues were resolved. Please see the attached log 20191231-203730-475463 - openHistorian - 1.log I can open a data stream from our server to your end. Please provide an IP address if you will.
Same issues remain with new log file:
(1) Still looks as if reconnect after no data is still 5.0 seconds: No data received in 5.0 seconds, restarting connect cycle...
(2) This error is still happening: MultiProtocolFrameParser "ReceivedConfigurationFrame" consumer event handler exception: Array dimensions exceeded supported range.
(3) This exception is related to IN-FQ30-3000
: Only one usage of each socket address (protocol/network address/port) is normally permitted in "transportprotocol=Tcp;server=127.0.0.1; port=10413; interface=0.0.0.0;
Note that you can check these logs yourself by running the Log File Viewer
application found in the Windows Start Menu
under the openHistorian
folder.
My suggestion is thus:
(1) Stop openHistorian (2) Validate that a single F-NET connection operates properly with PMU Connection Tester, if this works (3) Enable only one device F-NET device in openHistorian (same one you tested) and restart service
If you still have trouble, please send a single F-NET device to GPA IP (check nslookup gridprotectionalliance.org
) then let me know port so we can temporarily open fire wall.
Thanks, Ritchie
Going to close this issue, if you have any further need, please send a note to our discussions board:
https://discussions.gridprotectionalliance.org/c/gpa-products/openhistorian
Thanks, Ritchie
I had an exception that caused the openHistorian service to stop. After my initial investigation, I found there were several TLS issues reported by my Microsoft event viewer Log Name: System Source: Service Control Manager Date: 12/2/2019 5:56:48 PM Event ID: 7034 Task Category: None Level: Error Keywords: Classic User: N/A Computer: server Description: The openHistorian service terminated unexpectedly. It has done this 1 time(s). Event Xml: `