Show PV disconnections in the data browser

aawdls commented 2 years ago

Show in the databrowser the events where the archiver appliance disconnected from a PV. This is analogous to the behaviour of the old Java ArchiveViewer for the channel archiver. The disconnect is indicated with a dashed vertical line. This behaviour is enabled by a preference.

Please see example display in the screenshot below.

kasemir commented 2 years ago

Can you adjust the implementation? The RTPlot is meant to be a somewhat generic plotting package. It should not have to check each sample for the text "disconnected" inside the optional info text. How about using Double.isNaN(sample) or ! Double.isFinite(sample) as a criteria for samples that don't have a plottable value, and highlight those? Then in the archive data provider, assert that samples which are disconnected, archive-off, ... have Double.NaN as a value. That should be a little faster than the per-sample string comparison, and be fully generic.

shroffk commented 2 years ago

I really like this feature!!

+1 on refactoring the implementation as kay pointed out.

rjwills28 commented 2 years ago

Thanks for the input Kay. We just have one concern about indicating a disconnection with NaN and that is what if the PV is deliberately set to NaN (for some reason)? Below is a bit of background to why we chose to use a String.

Originally the databrowser would just draw a line between points before and after a disconnect meaning that you could never see if there had been a disconnection. As you have seen, our original attempt to fix this was to intercept when a disconnect had happened in the archive plugin code and replace the value with a VString with the value 'Disconnect'. As I'm sure you are aware, a VString will not be plotted in the databrowser as the attempt to 'getValue()' returns NaN (as it cannot be passed as a double or an array of doubles.). Following your suggestion to instead replace the disconnect value with a Double.NaN would satisfy this and mean that these data are not plotted and so the disconnect will show in the databrowser as a gap.

However in testing we found that simply showing a gap for a disconnection was not always sufficient, for example, if the disconnection was very short and the timespan shown in the databrowser was long. That was what motivated us to draw a vertical dashed line at the start of any disconnection period.

If we change the TracePainter to plot this vertical line when Double.isNaN(value) is true instead of using the specific 'Disconnect' string identifier, then we will be plotting this line for every circumstance that the value is NaN. This includes if for some reason the PV is actually set to NaN. Is this what we want to be doing?

It is also worth noting that the String comparison check in the TracePainter only gets called if the Double.isNaN(value) evaluates to true, so it is not called on all or even the majority of data points. That said, it does mean RTPlot is not generic.

What do you think is the best way forward? Assume a PV will not otherwise be set to NaN? Remove the dashed vertical line on the plot completely? Thanks

kasemir commented 2 years ago

Originally the databrowser would just draw a line between points before and after a disconnect

The data browser always showed a gap in the line when there's a gap in the archived data. At least for the original data sources, the channel archiver and the RDB-based archive. The archive appliance data source implementation might have missed that point, but it looks like you've now fixed that part.

Yes, the data is generally not expected to be NaN. That's meant to indicate a condition like disconnected or archive off. You're right that it is possible for a defunct CALC expression to also provide NaN, but that should be rare. In any case, the annotation or sample view can be used to determine why exactly there's a NaN, because it shows the description.

I'm OK with having a preference setting, and by default it may be 'true', to show a vertical dashed line for NaN samples. But there should be no check for "disconnected" in the info, after all what if some site decides to send "unterbrochen" or "pas de connection" for a sample's info?

ralphlange commented 2 years ago

"pas de connexion" would be more probable... but I do agree.

kasemir commented 2 years ago

Moin Ralph, How do you manage to read everything?! In any case, I blame the spell checker....

ralphlange commented 2 years ago

It's the minimum effort necessary to appear to be reading everything...

shroffk commented 2 years ago

The dashed vertical line is a very useful feature (significantly better than a small gap) to easily visualize an interesting condition. I would vote for including it.

I would also think that if an archived PV which is normally suppose to have a numeric value is entering a state where the value is NAN then that is an interesting condition too and it couldn't hurt to draw attention to it using the vertical line. Are there examples of PV set to NAN as part of regular workflow? If the PV value is generally/always NAN... then why is it being plotted in the first place

I think a preference would be an easy way to allow switching back if necessary...but instead of a single preference for the entire product wouldn't it make more sense for this to be an property for each trace? So if there is a PV with lot of NAN's which is filling up the entire plot then you can just turn off the disconnect plotting for that one trace while still showing it for the others.

aawdls commented 2 years ago

Thanks for the discussion so far.

In my opinion there are actually two distinct scenarios which need to be treated separately:

A PV truly has a value NaN, and this value is recorded by the archiver, and returned to the client. This has a number of possible meanings, such as that the inputs to a calculation are out of range, or that a piece of hardware is not present. We do have some use cases
The archiver lost its connection to a PV. This can mean that the IOC went offline, or the network connection was interrupted.

Both of these cases are potentially interesting but they mean different things. If we signify both of these with the same NaN value, then they will both be drawn with a vertical line, and we won't be able to distinguish them in the data browser. Originally, it was our intention that the dashed vertical line only means case 2.

I don't think these concepts are unique to the archiver appliance. Therefore if we can come up with a way to distinguish these two cases it should be applicable to other data sources too.

Is there any other way we could label a value to distinguish it from a "true" NaN, besides setting it to a string? (The value of the string was arbitrary, being a signifier set by the ApplianceValueIterator in CSS when receiving a disconnect event from the Archiver Appliance - so any international value would do instead! :) )

shroffk commented 2 years ago

@kasemir @aawdls we can bring this PR up for discussion today. As i mentioned before overall I think this is a very useful feature

aawdls commented 2 years ago

Simplified implementation as discussed. On testing, appears to work for the optimized case but not when zooming in for raw data. Investigating further.

aawdls commented 2 years ago

Actually, I was mistaken. Both cases are working.

When in Optimized mode, the line appears at the beginning of the bin.

When in Raw mode, the line appears at the location of the disconnect event.

These can be in different places, therefore when switching, the apparent location of the line moves.

I was zooming in to the wrong location.

kasemir commented 2 years ago

When in Optimized mode, the line appears at the beginning of the bin.

I assume this depends on how a specific data source implements the optimization. If the samples in a bin contain values NaN or Inf, what's the average? a) Return one overall "NaN" sample for the whole bin? Time stamped at the start, center or end of bin? b) Ignore non-numeric data, compute min/max/average for the rest? c) Ignore non-numeric data unless it's the only sample the bin, then return that one original sample? d) Pass all non-numeric samples as found, average over the rest

Looks like the archive appliance does a), time stamped at the start of the bin.

For timescaledb, there's an example close to d): https://github.com/ControlSystemStudio/phoebus/blob/16a2d8692cc530371f1ab33ed7af4966886eaa3b/app/databrowser-timescale/postgresqsl/setup.sql#L499

I wouldn't worry too much about it. If there's no data, then it's hard to produce a perfect representation. Users can switch from optimized to raw mode, but in the end you're debugging connection issues, the data is simply incomplete and you can't show that in any nice way. Still, passing the non-numeric samples through to allow debugging what happened might be better than suppressing them.

shroffk commented 1 year ago

@kasemir @aawdls can we merge this?

ControlSystemStudio / cs-studio

Show PV disconnections in the data browser #2713