IBMStreams / streamsx.monitoring

The com.ibm.streamsx.monitoring toolkit provides capabilities to create applications that monitor IBM Streams and its applications.
https://ibmstreams.github.io/streamsx.monitoring/
Other
5 stars 5 forks source link

JobStatusMonitor is missing events #103

Closed chanskw closed 6 years ago

chanskw commented 6 years ago

I am trying out the JobStatusMonitor sample.

I have two jobs: HealthDataBeaconService - This job generates patient data and publishes the data Router - This one subscribes the data from the beacon and routes it to different files

I launched the two jobs and also the JobStatusMonitor sample. All jobs are started with default fusion... so one pe per job.

While the JobStatusMonitor sample is running, I restarted the Router PE. I waited for the jobs to become healthy again.

I looked that the console from JobStatusMonitor to make sure that it has all the events, and here's what I have got:

{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="healthy",peStatus="stopping",eventTimestamp=(1512487951,247000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="unhealthy",peStatus="stopping",eventTimestamp=(1512487951,278000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=1,jobName="com.ibm.streamsx.health.simulate.beacon.services::HealthDataBeaconService_1",resource="streamsqse.localdomain",peId=1,peHealth="healthy",peStatus="running",eventTimestamp=(1512487953,485000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="unhealthy",peStatus="stopped",eventTimestamp=(1512487953,511000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="unhealthy",peStatus="restarting",eventTimestamp=(1512487953,832000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="partiallyUnhealthy",peStatus="restarting",eventTimestamp=(1512487953,854000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="partiallyHealthy",peStatus="running",eventTimestamp=(1512487955,166000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=1,jobName="com.ibm.streamsx.health.simulate.beacon.services::HealthDataBeaconService_1",resource="streamsqse.localdomain",peId=1,peHealth="partiallyHealthy",peStatus="running",eventTimestamp=(1512487957,840000000,0)}
{notifyType="com.ibm.streams.management.pe.changed",domainId="StreamsDomain",instanceId="StreamsInstance",jobId=0,jobName="com.ibm.streamsx.health.prepare::Router_0",resource="streamsqse.localdomain",peId=0,peHealth="healthy",peStatus="running",eventTimestamp=(1512487960,829000000,0)}

HealthDataBeaconService is healthy at this point.. but it is missing in the event print out.

I restarted the PE the second time and the healthy event shows up in the midst of the second set of events.

If the jobs become healthy, the JobStatusMonitor needs to report it in a timely manner. Clients are trying to use this to determine the status of the jobs... and if the jobs stay unhealthy, clients need to manually intervene.

chanskw commented 6 years ago

This seems to be an intermittent problem... I can reproduce it sometimes.. but not all the time.

markheger commented 6 years ago

workaround: Use the JobStatusSource operator instead of JobStatusMonitor because the JobStatusMonitor includes a DeDuplicate operator, which might filter out messages in a 5 seconds intervall.