NSLS-II / arvpyf

arVpyf: Archiver Python Frontend
https://nsls-ii.github.io/arvpyf
BSD 3-Clause "New" or "Revised" License
2 stars 2 forks source link

ENH for better usability and assurances of correct time-stamping FOR RETRIEVAL ONLY #2

Open ambarb opened 3 years ago

ambarb commented 3 years ago

issue description

The current implementation of this functionality is far less flexible than my own methodology inspired by the inner working of this library 2 years ago. My implementation of the insides of this core function for discussion arvReader.get() allowed me to get millisecond time and second time for more than 1 PV at a time. But with all these recent changes to our systems, I cannot use a databroker that "works" with my hacked implementation.

https://nsls-ii.github.io/arvpyf/retrieval.html#data-retrieval

specific pain points

  1. the time arguments only allow strings. But searching catalogs or headers (databroker V2 or V1) allowed for both, however things like header.start.time is only an epoch so right away, someone has to start converting time stamps, which is difficult to make the "right" decisions
  2. the returned data is also in this string format, but is a pandas object makes plotting against other data MORE DIFFICULT
  3. Only 1 PV at a time can be retrieved. Typically, more than 1 PV is needed in order to compare with real beamline. On the old systems, retrieving the PVs one at a time had a huge overhead in connecting to the server. Maybe it is different now.

For item 3, things are made more difficult because:

reasons for pain points

I am not sure if people are collecting requirements or not, but here is what I would recommend for people wanting to supplement the beamline experiments with archiver data, which is a must for CSX. The amount of time and user input to retrieve archiver data using CSS or pheobus is not scalable.

@danielballan and @tankonst confirmed that integrating this library with databroker that there is a 4 hour time difference introduced. I found this problem with my own implementation, and they confirmed it separately with much simpler code because they relied on pandas (as i was relying on epoch).

suggested solutions

Aside from an opportunity to collect requirements for a broad audience I would recommend the following updates to this library: 1) query and return data with string or epoch times 2) do not force the query "since" and "until" times to be run "start" and "stop" times - this should be chosen by the user 3) returned dataframe by arvReader.get starts with an index of 0, which is not consistent with pandas series/dataframes returned by databroker V1 4) it would be nice if we can optimize for more than 1 pv, but maybe the new IT systems are much faster now and we need to force 1PV at a time to prevent bad things from happening to the network. 5) add test to ensure that time conversions for strings and epoch floats are not compromised after this issue is addressed. the test may need to be at a different level and not associated directly with this library 6) make sure databroker, archiver api, and olog/CSS are all consistent with timestamping. MAYBE this is just a matter of documentation on which functions and arguments to use would be sufficient if "coding" this is problematic. 7) LEAST IMPORTANT ISSUE: solve problem of units (apparently archiver doesn't know this, CSS gets it from EPICS as I understand it)

illustration of time conversion issue that isn't easy to get right.

from datetime import datetime
import pandas as pd
from arvpyf.ar import ArchiverReader

def get_pv(since, until,  pv):
    since_ts = since
    until_ts = until  

    ###### BELOW CONTRIBUTED BY DAN AND TATIANA
    # TODO Worry about timezones.

    # arvReader.get only accepts dates as strings formatted as specified below,
    # so we have to convert, just for it to convert back. Cool cool cool. THANKS 
    since_str = datetime.fromtimestamp(since_ts).strftime("%Y-%m-%d %H:%M:%S")
    until_str = datetime.fromtimestamp(until_ts).strftime("%Y-%m-%d %H:%M:%S")

    return arvReader.get(pv, since_str, until_str)

URL = 'http://archiver.csx.nsls2.bnl.local:17668'
arvReader = ArchiverReader({'url': URL, 'timezone': 'US/Eastern'})
pv = 'XF:23ID1-ES{TCtrl:1-Chan:A}T-I'

scans  = [141796, 141798, 141805, 141810]

df = get_pv(db[scans[0]].start["time"], db[scans[-1]].stop["time"], pv_dict["sampleTa"])       #contains a time convert to from epoch to string to GET data
fig, ax = plt.subplots()
ax.plot(df["time"],df["data"])

for s in scans:
    h=db[s]
    Ta = h.table('baseline')['stemp_temp_A_T']
    since_time = pd.to_datetime(h.start["time"], unit="s")      # converting to a string format using pandas
    until_time = pd.to_datetime(h.stop["time"], unit="s")       # converting to a string format using pandas
    ax.plot(since_time,Ta[1],'o', mfc='none', markersize = 8, label=s)
    ax.plot(until_time,Ta[2],'o', color = plt.gca().lines[-1].get_color(), markersize = 8)
ax.legend()

print(f'{df["time"][0]}')
print(f'{pd.to_datetime(db[scans[0]].start["time"], unit="s")}' )

Screen Shot 2021-05-26 at 3 50 00 PM

Screen Shot 2021-05-26 at 5 05 46 PM

Screen Shot 2021-05-24 at 2 58 33 PM

ambarb commented 3 years ago

I noticed that the df.time[0] is ~2 seconds before the start document. Perhaps the 0th index is meant to be the first point recorded in the archiver recorded JUST PRIOR to the the since timestamp argument. If so, this is a nice feature, but it should be better explained if people want to keep this feature.

The workaround I have now that meets minimum need:

def get_pv(since, until,  pv, return_epoch=True):
    since_ts = since
    until_ts = until  

    ###### BELOW CONTRIBUTED BY DAN AND TATIANA
    # TODO Worry about timezones if returning timestamps only
    # arvReader.get only accepts dates as strings formatted as specified below,
    # so we have to convert, just for it to convert back. 
    since_str = datetime.fromtimestamp(since_ts).strftime("%Y-%m-%d %H:%M:%S")
    until_str = datetime.fromtimestamp(until_ts).strftime("%Y-%m-%d %H:%M:%S")
    df = arvReader.get(pv, since_str, until_str)

    if return_epoch == True:
        df["time"] =  df.time.astype('Int64')/1e9
    return df

No other special lost 4 hours if epoch is returned from archiver with the function above. It just works timestamps as retrieved by databroker '1.2.3'.

ambarb commented 3 years ago

Using beamline server to access data from beamline archiver or accelerator archiver is really fast in this single pv in at a time mode after RE-IP. Slow part is now plot rendering in notebook with ssh tunnel.

ambarb commented 3 years ago

image

The factor of 1e9 is because data recorded by bluesky and retrieved by databroker is in nanoseconds, not seconds. Should databroker be using nanoseconds?

But to convert this to a time stamp, table.time.astype('int64')/1e9 which is different for the archiver, df_archiver.time.astype('Int64')/1e9

the difference is lower case vs. upper case "i" , "I" . This was not a trivial thing for me to figure out.