ENH for better usability and assurances of correct time-stamping FOR RETRIEVAL ONLY

ambarb commented 3 years ago

issue description

The current implementation of this functionality is far less flexible than my own methodology inspired by the inner working of this library 2 years ago. My implementation of the insides of this core function for discussion arvReader.get() allowed me to get millisecond time and second time for more than 1 PV at a time. But with all these recent changes to our systems, I cannot use a databroker that "works" with my hacked implementation.

https://nsls-ii.github.io/arvpyf/retrieval.html#data-retrieval

specific pain points

the time arguments only allow strings. But searching catalogs or headers (databroker V2 or V1) allowed for both, however things like header.start.time is only an epoch so right away, someone has to start converting time stamps, which is difficult to make the "right" decisions
the returned data is also in this string format, but is a pandas object makes plotting against other data MORE DIFFICULT
Only 1 PV at a time can be retrieved. Typically, more than 1 PV is needed in order to compare with real beamline. On the old systems, retrieving the PVs one at a time had a huge overhead in connecting to the server. Maybe it is different now.

For item 3, things are made more difficult because:

header start and stop time have to be converted to a pandas object or you have to convert all of the archiver data to epoch
if i want to plot against a time series in a run, these are also epoch if in the table
if i want to plot against time series data within a single read (like multiple images of CCD per "point") then these are manually converted to seconds unless your image has some sort of timestamping in the image file header, and then having this in epoch is better to do things like FFT analysis.

reasons for pain points

I am not sure if people are collecting requirements or not, but here is what I would recommend for people wanting to supplement the beamline experiments with archiver data, which is a must for CSX. The amount of time and user input to retrieve archiver data using CSS or pheobus is not scalable.

@danielballan and @tankonst confirmed that integrating this library with databroker that there is a 4 hour time difference introduced. I found this problem with my own implementation, and they confirmed it separately with much simpler code because they relied on pandas (as i was relying on epoch).

illustration of time conversion issue that isn't easy to get right.

from datetime import datetime
import pandas as pd
from arvpyf.ar import ArchiverReader

def get_pv(since, until,  pv):
    since_ts = since
    until_ts = until  

    ###### BELOW CONTRIBUTED BY DAN AND TATIANA
    # TODO Worry about timezones.

    # arvReader.get only accepts dates as strings formatted as specified below,
    # so we have to convert, just for it to convert back. Cool cool cool. THANKS 
    since_str = datetime.fromtimestamp(since_ts).strftime("%Y-%m-%d %H:%M:%S")
    until_str = datetime.fromtimestamp(until_ts).strftime("%Y-%m-%d %H:%M:%S")

    return arvReader.get(pv, since_str, until_str)

URL = 'http://archiver.csx.nsls2.bnl.local:17668'
arvReader = ArchiverReader({'url': URL, 'timezone': 'US/Eastern'})
pv = 'XF:23ID1-ES{TCtrl:1-Chan:A}T-I'

scans  = [141796, 141798, 141805, 141810]

df = get_pv(db[scans[0]].start["time"], db[scans[-1]].stop["time"], pv_dict["sampleTa"])       #contains a time convert to from epoch to string to GET data
fig, ax = plt.subplots()
ax.plot(df["time"],df["data"])

for s in scans:
    h=db[s]
    Ta = h.table('baseline')['stemp_temp_A_T']
    since_time = pd.to_datetime(h.start["time"], unit="s")      # converting to a string format using pandas
    until_time = pd.to_datetime(h.stop["time"], unit="s")       # converting to a string format using pandas
    ax.plot(since_time,Ta[1],'o', mfc='none', markersize = 8, label=s)
    ax.plot(until_time,Ta[2],'o', color = plt.gca().lines[-1].get_color(), markersize = 8)
ax.legend()

print(f'{df["time"][0]}')
print(f'{pd.to_datetime(db[scans[0]].start["time"], unit="s")}' )

Screen Shot 2021-05-26 at 3 50 00 PM

Screen Shot 2021-05-26 at 5 05 46 PM

Screen Shot 2021-05-24 at 2 58 33 PM

ambarb commented 3 years ago

I noticed that the df.time[0] is ~2 seconds before the start document. Perhaps the 0th index is meant to be the first point recorded in the archiver recorded JUST PRIOR to the the since timestamp argument. If so, this is a nice feature, but it should be better explained if people want to keep this feature.

The workaround I have now that meets minimum need:

def get_pv(since, until,  pv, return_epoch=True):
    since_ts = since
    until_ts = until  

    ###### BELOW CONTRIBUTED BY DAN AND TATIANA
    # TODO Worry about timezones if returning timestamps only
    # arvReader.get only accepts dates as strings formatted as specified below,
    # so we have to convert, just for it to convert back. 
    since_str = datetime.fromtimestamp(since_ts).strftime("%Y-%m-%d %H:%M:%S")
    until_str = datetime.fromtimestamp(until_ts).strftime("%Y-%m-%d %H:%M:%S")
    df = arvReader.get(pv, since_str, until_str)

    if return_epoch == True:
        df["time"] =  df.time.astype('Int64')/1e9
    return df

No other special lost 4 hours if epoch is returned from archiver with the function above. It just works timestamps as retrieved by databroker '1.2.3'.

ambarb commented 3 years ago

Using beamline server to access data from beamline archiver or accelerator archiver is really fast in this single pv in at a time mode after RE-IP. Slow part is now plot rendering in notebook with ssh tunnel.

ambarb commented 3 years ago

The factor of 1e9 is because data recorded by bluesky and retrieved by databroker is in nanoseconds, not seconds. Should databroker be using nanoseconds?

But to convert this to a time stamp, table.time.astype('int64')/1e9 which is different for the archiver, df_archiver.time.astype('Int64')/1e9

the difference is lower case vs. upper case "i" , "I" . This was not a trivial thing for me to figure out.

NSLS-II / arvpyf