SpatioTemporal / STAREPandas

STAREpandas adds SpatioTemporal Adaptive Resolution Encoding (STARE) support to pandas DataFrames. https://starepandas.readthedocs.io/en/latest/
MIT License
4 stars 1 forks source link

Advise for pod chucking name formation (example). #147

Open mbauer288 opened 1 year ago

mbauer288 commented 1 year ago

Output of test.py

Oddities with these particular IMERG files

Model simulated dataset that has been interpolated to the IMERG grids (0.1x0.1). Provided by Jiun-Dar Chern on Discover.

From today's zoom meeting I see that unlike IMERG data these are instantaneous values rather than spanning an time interval.

Well, my general question still stands regarding how to encode IMERG files in terms of time. For example, should each time sample overlap? If not how and by how much to I force a separation in terms of STARE time covers? As I have done it below, they are in principle separated by 1 ms. However, they still report as intersecting. So I must be doing something wrong there.

Mike

IMERG Time Standards

For IMERG a single day consists of 48 half-open intervals centered on the hour or half-hour: ['00:00', '00:30', ... '23:00', '23:30']

The half-open intervals are such that the Lower Bound (LB), Center (CR) and Upper Bound (UB) of the interval are written as (LB CR UB], or equally, LB < CR <= UB

For example (using format HH:MM:SS.ms):

    (00:45:00.000 01:00:00.000 01:15:00.000]
                                           (01:15:00.000 01:30:00.000 01:45:00.000]

Reference: Integrated Multi-satellitE Retrievals for GPM (IMERG) Technical Documentation

In practice, I use a 1 millisecond forward-offset to ensure that the LB is half-open or (CR-15m+1ms, CR, CR+15m]

For example,

    (00:45:00.001 01:00:00.000 01:15:00.000]
                                           (01:15:00.001 01:30:00.000 01:45:00.000]

Thus, IMERG intervals defined this way are non-overlapping. Except, when I use pystare.temporal_overlap to test this on the TIV covers I find that they do overlap.

Usage Summary:

Basically, I

  1. For each IMERG file I form three datetimes around a central time to make an half-open interval.
  2. Convert these to numpy datetime int64..
  3. Apply pystare.from_utc_variable() to get a TIV triplet for the interval.
  4. Apply pystare.from_temporal_triple() to this to get a TIV cover for the interval.
  5. Use the hex() or pystare.hex16() (seemingly give the same answer) as the <tcover> part of the chunk_name:
    chunk_name format:
        <pod-root>/<sid>/<tcover>-<dataset-name-pattern>
    where
        <sid>    is the hex representation of the SID
        <tcover> is the hex representation of the TIV covering the temporal range/interval of the granule.

    I also see pystare.format_tpod(), pystare.make_tpod_tuple(), so should I be using these or similar instead?

Does this seem correct?


Example based three consecutive files for the PMOD discover data directory:

    'DYAMONDv2_PE3600x1800-DE.prectot.20200116_0000z.nc4' => <tcover> = 0x1f90220025001c71
    'DYAMONDv2_PE3600x1800-DE.prectot.20200116_0030z.nc4' => <tcover> = 0x1f902207a5001c71
    'DYAMONDv2_PE3600x1800-DE.prectot.20200116_0100z.nc4' => <tcover> = 0x1f90221025001c71

Screen dump

halfstep                = 0 days 00:15:00, <class 'pandas._libs.tslibs.timedeltas.Timedelta'>
offset                  = 0 days 00:00:00.001000, <class 'pandas._libs.tslibs.timedeltas.Timedelta'>
imerg_time_res          = 28, <class 'int'>
imerg_time_res_arr      = [28 28 28], <class 'numpy.int64'>

Reading /Users/mbauer/tmp/data/POMD/discover/202001/DYAMONDv2_PE3600x1800-DE.prectot.20200116_0000z.nc4
    t_val               = 0.0
    t_units             = minutes since 2020-01-16 00:00:00
    t_begin_date        = 20200116
    t_begin_time        = 0
    t_begin_time_str    = 0
    t_time_increment    = 164000

    cr_ts_str           = 2020-01-16 00:00:00, <class 'str'>
    cr_ts               = 2020-01-16 00:00:00, <class 'datetime.datetime'>
    cr_dt               = 2020-01-16 00:00:00, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    lb_ts               = 2020-01-15 23:45:00.001000, <class 'datetime.datetime'>
    lb_dt               = 2020-01-15 23:45:00.001000, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    ub_ts               = 2020-01-16 00:15:00, <class 'datetime.datetime'>
    ub_dt               = 2020-01-16 00:15:00, <class 'pandas._libs.tslibs.timestamps.Timestamp'>

    interval_npdt       = [datetime.datetime(2020, 1, 15, 23, 45, 0, 1000), datetime.datetime(2020, 1, 16, 0, 0), datetime.datetime(2020, 1, 16, 0, 15)], type(interval_npdt[0]) = <class 'numpy.datetime64'>
    interval_npdt_int   = [1579131900001, 1579132800000, 1579133700000], type(interval_npdt_int[0]) = <class 'numpy.int64'>
    interval_tiv_triple = [2274354625681316977, 2274355195838209137, 2274355211944336497], type(interval_tiv_triple[0]) = <class 'numpy.int64'>
    interval_tiv_cover  = 2274355195838209137, <class 'numpy.int64'>

    2020-01-16 00:00:00
        TimeStamps    (2020-01-15 23:45:00.001000 <-> 2020-01-16 00:00:00.000000 <-> 2020-01-16 00:15:00.000000)
        TIV triple    (2274354625681316977        <->    2274355195838209137     <->        2274355211944336497)
        TIV cover     (                                  2274355195838209137                                   )
        TIV cover hex (                                   0x1f90220025001c71                                   )
        TIV cover bin (          0b0001111110010000001000100000000000100101000000000001110001110001            )

Reading /Users/mbauer/tmp/data/POMD/discover/202001/DYAMONDv2_PE3600x1800-DE.prectot.20200116_0030z.nc4
    t_val               = 0.0
    t_units             = minutes since 2020-01-16 00:30:00
    t_begin_date        = 20200116
    t_begin_time        = 3000
    t_begin_time_str    = 3000
    t_time_increment    = 164000

    cr_ts_str           = 2020-01-16 00:30:00, <class 'str'>
    cr_ts               = 2020-01-16 00:30:00, <class 'datetime.datetime'>
    cr_dt               = 2020-01-16 00:30:00, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    lb_ts               = 2020-01-16 00:15:00.001000, <class 'datetime.datetime'>
    lb_dt               = 2020-01-16 00:15:00.001000, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    ub_ts               = 2020-01-16 00:45:00, <class 'datetime.datetime'>
    ub_dt               = 2020-01-16 00:45:00, <class 'pandas._libs.tslibs.timestamps.Timestamp'>

    interval_npdt       = [datetime.datetime(2020, 1, 16, 0, 15, 0, 1000), datetime.datetime(2020, 1, 16, 0, 30), datetime.datetime(2020, 1, 16, 0, 45)], type(interval_npdt[0]) = <class 'numpy.datetime64'>
    interval_npdt_int   = [1579133700001, 1579134600000, 1579135500000], type(interval_npdt_int[0]) = <class 'numpy.int64'>
    interval_tiv_triple = [2274355211944352881, 2274355228050463857, 2274355244156591217], type(interval_tiv_triple[0]) = <class 'numpy.int64'>
    interval_tiv_cover  = 2274355228050463857, <class 'numpy.int64'>

    2020-01-16 00:30:00
        TimeStamps    (2020-01-16 00:15:00.001000 <-> 2020-01-16 00:30:00.000000 <-> 2020-01-16 00:45:00.000000)
        TIV triple    (2274355211944352881        <->    2274355228050463857     <->        2274355244156591217)
        TIV cover     (                                  2274355228050463857                                   )
        TIV cover hex (                                   0x1f902207a5001c71                                   )
        TIV cover bin (          0b0001111110010000001000100000011110100101000000000001110001110001            )

Reading /Users/mbauer/tmp/data/POMD/discover/202001/DYAMONDv2_PE3600x1800-DE.prectot.20200116_0100z.nc4
    t_val               = 0.0
    t_units             = minutes since 2020-01-16 01:00:00
    t_begin_date        = 20200116
    t_begin_time        = 10000
    t_begin_time_str    = 10000
    t_time_increment    = 164000

    cr_ts_str           = 2020-01-16 01:00:00, <class 'str'>
    cr_ts               = 2020-01-16 01:00:00, <class 'datetime.datetime'>
    cr_dt               = 2020-01-16 01:00:00, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    lb_ts               = 2020-01-16 00:45:00.001000, <class 'datetime.datetime'>
    lb_dt               = 2020-01-16 00:45:00.001000, <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    ub_ts               = 2020-01-16 01:15:00, <class 'datetime.datetime'>
    ub_dt               = 2020-01-16 01:15:00, <class 'pandas._libs.tslibs.timestamps.Timestamp'>

    interval_npdt       = [datetime.datetime(2020, 1, 16, 0, 45, 0, 1000), datetime.datetime(2020, 1, 16, 1, 0), datetime.datetime(2020, 1, 16, 1, 15)], type(interval_npdt[0]) = <class 'numpy.datetime64'>
    interval_npdt_int   = [1579135500001, 1579136400000, 1579137300000], type(interval_npdt_int[0]) = <class 'numpy.int64'>
    interval_tiv_triple = [2274355244156607601, 2274355264557685873, 2274355280663813233], type(interval_tiv_triple[0]) = <class 'numpy.int64'>
    interval_tiv_cover  = 2274355264557685873, <class 'numpy.int64'>

    2020-01-16 01:00:00
        TimeStamps    (2020-01-16 00:45:00.001000 <-> 2020-01-16 01:00:00.000000 <-> 2020-01-16 01:15:00.000000)
        TIV triple    (2274355244156607601        <->    2274355264557685873     <->        2274355280663813233)
        TIV cover     (                                  2274355264557685873                                   )
        TIV cover hex (                                   0x1f90221025001c71                                   )
        TIV cover bin (          0b0001111110010000001000100001000000100101000000000001110001110001            )

Temporal Overlap between 2020-01-16 00:00:00 and 2020-01-16 00:30:00 = True
Temporal Overlap between 2020-01-16 00:30:00 and 2020-01-16 01:00:00 = True
Temporal Overlap between 2020-01-16 00:00:00 and 2020-01-16 01:00:00 = False

Example code

#! /usr/bin/env python -tt
# -*- coding: utf-8; mode: python -*-
r"""

imergL3
~~~~~~~
"""
# Standard Imports
from datetime import datetime, timezone, timedelta

# Third-Party Imports
import numpy
import pandas
import netCDF4

# STARE Imports
import pystare
import starepandas

# Local Imports

##
# Markup Language Specification (see Google Python Style Guide https://google.github.io/styleguide/pyguide.html)
__docformat__ = "Google en"
# ---

##
# IMEG time interval info
imerg_tstep = 1800000       # 30 minutes in milliseconds
imerg_halfstep = 9e+11      # 15 minutes in nanoseconds
imerg_toffset = 1e+6        # 1 millisecond in nanoseconds

halfstep = pandas.Timedelta(imerg_halfstep, unit ='ns')
offset = pandas.Timedelta(imerg_toffset, unit ='ns')
print(f"halfstep                = {halfstep}, {type(halfstep)}")
print(f"offset                  = {offset}, {type(offset)}")

##
# Determine the STARE time resolution
times = numpy.array([imerg_tstep], dtype=numpy.int64)
time_res_array = pystare.coarsest_resolution_finer_or_equal_ms(times)
imerg_time_res = int(time_res_array[0])
imerg_time_res_arr = numpy.array([imerg_time_res, imerg_time_res, imerg_time_res], dtype=numpy.int64)
print(f"imerg_time_res          = {imerg_time_res}, {type(imerg_time_res)}")
print(f"imerg_time_res_arr      = {imerg_time_res_arr}, {type(imerg_time_res_arr[0])}")

pod_cover_tivs = []
pod_ts_str = []
for idx, ifile in enumerate(["/Users/mbauer/tmp/data/POMD/discover/202001/DYAMONDv2_PE3600x1800-DE.prectot.20200116_0000z.nc4",
                             "/Users/mbauer/tmp/data/POMD/discover/202001/DYAMONDv2_PE3600x1800-DE.prectot.20200116_0030z.nc4",
                             "/Users/mbauer/tmp/data/POMD/discover/202001/DYAMONDv2_PE3600x1800-DE.prectot.20200116_0100z.nc4"]):
    ##
    # Grab time data
    print(f"\nReading {ifile}")
    nc_file = netCDF4.Dataset(ifile, mode='r', format='NETCDF4')

    ##
    # Pull from netCDF4 into numpy variables
    time_var = nc_file.variables['time']
    t_val = time_var[:][0]
    t_units = time_var.units
    t_begin_date = time_var.begin_date
    t_begin_time = time_var.begin_time
    t_begin_time_str = str(t_begin_time)
    len_t_begin_time_str = len(t_begin_time_str)
    t_time_increment = time_var.time_increment
    print(f"\tt_val               = {t_val}")
    print(f"\tt_units             = {t_units}")
    print(f"\tt_begin_date        = {t_begin_date}")
    print(f"\tt_begin_time        = {t_begin_time}")
    print(f"\tt_begin_time_str    = {t_begin_time_str}")
    print(f"\tt_time_increment    = {t_time_increment}\n")

    ##
    # Use time units string as the CR datetime.
    cr_ts_str = t_units[14:]
    print(f"\tcr_ts_str           = {cr_ts_str}, {type(cr_ts_str)}")

    ##
    # Form the CR datetime
    cr_ts = datetime.strptime(cr_ts_str, '%Y-%m-%d %H:%M:%S')
    print(f"\tcr_ts               = {cr_ts}, {type(cr_ts)}")

    ##
    # Make it a Pandas TimeStamp
    cr_dt = pandas.to_datetime(cr_ts, utc=False, unit='ns')
    print(f"\tcr_dt               = {cr_dt}, {type(cr_dt)}")

    ##
    # Find the half-open Lower Bound (LB)
    lb_ts = cr_ts - halfstep + offset
    print(f"\tlb_ts               = {lb_ts}, {type(lb_ts)}")
    lb_dt = pandas.to_datetime(lb_ts, utc=False, unit='ns')
    print(f"\tlb_dt               = {lb_dt}, {type(lb_dt)}")

    ##
    # Find the closed Upper Bound (UB)
    ub_ts = cr_ts + halfstep
    print(f"\tub_ts               = {ub_ts}, {type(ub_ts)}")
    ub_dt = pandas.to_datetime(ub_ts, utc=False, unit='ns')
    print(f"\tub_dt               = {ub_dt}, {type(ub_dt)}\n")

    ##
    # Create an interval
    interval = [lb_dt, cr_dt, ub_dt]

    ##
    # Convert interval to numpy datetimes
    interval_npdt = numpy.array(interval, dtype='datetime64[ms]')
    print(f"\tinterval_npdt       = {interval_npdt.tolist()}, {type(interval_npdt[0]) = }")

    ##
    # Convert to 64-bit integer representation (based on milliseconds & UTC)
    interval_npdt_int = numpy.array(interval, dtype='datetime64[ms]').astype(numpy.int64)
    print(f"\tinterval_npdt_int   = {interval_npdt_int.tolist()}, {type(interval_npdt_int[0]) = }")

    ##
    # Convert integer representation to TIV triplet
    interval_tiv_triple = pystare.from_utc_variable(interval_npdt_int, imerg_time_res_arr, imerg_time_res_arr)
    print(f"\tinterval_tiv_triple = {interval_tiv_triple.tolist()}, {type(interval_tiv_triple[0]) = }")

    ##
    # Calculate an interval TIV (cover).
    interval_tiv_cover = pystare.from_temporal_triple(interval_tiv_triple)[0]
    print(f"\tinterval_tiv_cover  = {interval_tiv_cover}, {type(interval_tiv_cover)}")

    ##
    # Summary
    print("\n")
    print(f"\t{cr_ts_str}")
    print(f"\t\tTimeStamps    ({interval[0].strftime('%Y-%m-%d %H:%M:%S.%f'):<26} <-> {interval[1].strftime('%Y-%m-%d %H:%M:%S.%f'):^26} <-> {interval[2].strftime('%Y-%m-%d %H:%M:%S.%f'):>26})")
    print(f"\t\tTIV triple    ({interval_tiv_triple[0]:<26d} <-> {interval_tiv_triple[1]:^26d} <-> {interval_tiv_triple[2]:>26d})")
    print(f"\t\tTIV cover     ({'':<26s}     {interval_tiv_cover:^26d}     {'':>26s})")
    print(f"\t\tTIV cover hex ({'':<26s}     {hex(interval_tiv_cover):^26}     {'':>26s})")
    print(f"\t\tTIV cover bin ({'':<10s}0b{numpy.binary_repr(interval_tiv_cover, width=64):64s}{'':>12s})")

    pod_ts_str.append(cr_ts_str)
    pod_cover_tivs.append(interval_tiv_cover)

##
# Compare if overlapping
overlap = pystare.temporal_overlap(numpy.array([pod_cover_tivs[0]]), numpy.array([pod_cover_tivs[1]]))
print(f"\nTemporal Overlap between {pod_ts_str[0]} and {pod_ts_str[1]} = {bool(overlap[0])}")

overlap = pystare.temporal_overlap(numpy.array([pod_cover_tivs[1]]), numpy.array([pod_cover_tivs[2]]))
print(f"Temporal Overlap between {pod_ts_str[1]} and {pod_ts_str[2]} = {bool(overlap[0])}")

overlap = pystare.temporal_overlap(numpy.array([pod_cover_tivs[0]]), numpy.array([pod_cover_tivs[2]]))
print(f"Temporal Overlap between {pod_ts_str[0]} and {pod_ts_str[2]} = {bool(overlap[0])}")
michaelleerilee commented 1 year ago

This is likely because there is a finite, irregular resolution with which intervals are covered with STARE tids. The forward and reverse resolutions each point to a bit in the representation. When a tid is constructed from a triple (t0,t1,t2), t1 is used to set the temporal location bits. Then the reverse resolution is set so that a decrement associated with that resolution's bit position is less than t0 (for the case where a cover is desired). Similarly, the forward resolution is set so that an increment at the associated bit position is greater than t1. Therefore, the actual lower and upper bounds that one can calculate from the above tid (a single 64-bit integer) are generally not t0 and t2. Generally, we've set the tid to cover the time interval of the granule, which leads to "overestimates" in joins or search queries.

michaelleerilee commented 1 year ago

One can check this by "round-tripping" from a temporal triple to a tid/tiv and back again.

This also means that when doing searches, the results using only the tid will be approximate, and there will need to be another step or pass for a more accurate comparison.

Part of this is due to putting three temporal instants into a single 64-bit integer. (If I'm gauging your issue correctly.)

michaelleerilee commented 1 year ago

We could have a different name schema, which includes more precise interval endpoint info... If warranted, but you start to move away from the tiv-cover idea.

michaelleerilee commented 1 year ago

You probably want to take a look at write_pods_granule in staredataframe.py. It assumes ts_start and ts_end columns in the granule data from from which it calculates a tid cover for the chunk name. This is aligned with the way starepandas searches for chunks.