Open lbooker42 opened 4 months ago
Other count operations like count_null
, count_nan
, etc. would be useful.
As we have done in other cases, null
values should be ignored, and NaN
values are included -- typically resulting in poisoning.
Below is an attempt at a more comprehensive and carefully curated list.
As has been the case for other operations:
null
values are ignored in calculations.NaN
values are included in calculations. Typically, this means that NaN
poisons results, so the operator will return NaN
after seeing a NaN
.+0.0
and -0.0
are considered to be the same and equivalent.Operators have a few different contexts:
cum_avg
cum_wavg
cum_std
cum_count
https://github.com/deephaven/deephaven-core/pull/6270cum_formula
?cum_group
delta_pct
(Naming seems more consistent with the existing delta
than the originally proposed pct_change
) count_neg
+ https://github.com/deephaven/deephaven-core/pull/6270count_pos
+ https://github.com/deephaven/deephaven-core/pull/6270count_zero
+ https://github.com/deephaven/deephaven-core/pull/6270count_null
+ https://github.com/deephaven/deephaven-core/pull/6270count_nan
+ https://github.com/deephaven/deephaven-core/pull/6270count_inf
+ https://github.com/deephaven/deephaven-core/pull/6270count_finite
+ https://github.com/deephaven/deephaven-core/pull/6270first
!last
!offset
!median
rank
percentile
(pct
may be a name more consistent with agg)abs_sum
abs_avg
abs_wavg
+wstd
ste
wste
var
wvar
tstat
wtstat
skew
*+kurtosis
*+cov
*cor
These are present in agg, but they may not be worth adding to the other cases until there is demand. They need some discussion.
distinct
unique
sorted_first
sorted_last
(?) There will be some debate on if this method should be implemented because of efficiency.
(*) May involve some tricky, careful numerics to compute good values. Need to be careful in defining the calculation.
(+) Not yet implemented in Numerics.ftl
(!) There has been some discussion around these operations with @rcaudy and @chipkent . cum_first
/cum_last
are the same as first_by
/last_by
, so there is an argument to not include them. offset
is proposed as a way to get a value at a specific index or time offset instead of having a first
/last
operator. For time offsets, there needs to be a way to disambiguate if there are multiple values with the same time offset. offset
would not be supported by agg, but first
and last
would.
Details on computing skewness and excess kurtosis can be found at:
We want the sample skewness and sample excess kurtosis. The formulae used by Excel, SAS, etc. have probably been well vetted.
Details on computing the sample covariance can be found at:
To support production use cases, we need the following operators (also found in #4424):
But also needed are the following (supported by pandas / Polars):