awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.57k stars 749 forks source link

Arrays With Named Axes #1864

Open jaheba opened 2 years ago

jaheba commented 2 years ago

WIP


Tensor

The core idea is to have a wrapper around nd-arrays where dimensions are not referenced by integers but single-letter names instead:

TimeSeries = Tensor["T"]
Categories = Tensor["C"]

Features = Tensor["CT"]

Here TimeSeries and Categories are both 1-D arrays, however with incompatible dimensions. Features is a 2-D array:

>>> TimeSeries(np.array([1, 2, 3]))
Tensor<T=3>

>>> Categories(np.array([1, 2, 3]))
Tensor<C=3>

>>> Features(np.array([[4, 5, 6], [7, 8, 9]]))
Tensor<C=2, T=3>

When we operate on these tensors, we need to specify on which axes we want to operate on:

>>> feat = Features(np.array([[4, 5, 6], [7, 8, 9]]))

>>> feat.C[1]
Tensor<T=3>

>>> feat.T[:2]
Tensor<C=2, T=2>

>>> np.sum(feat, axis="T")
Tensor<C=2>

>>> np.sum(feat, axis="C")
Tensor<T=3>

Indexing

Further, it's possible to add an index:

>>> ts = TimeSeries(
    np.array([1, 2, 3]),
    index={"T": np.array(["a", "b", "c"])}
)

>>> ts.Ti[:"b"]
Tensor<T=2>

>>> ts.Ti[:"b"].index["T"]
array(['a', 'b'], dtype='<U1')

TensorFrame

A TensorFrame is a collection of Tensors, which can share dimensions:

>>> tf = TensorFrame(
    {
        "target": TimeSeries(np.array([1, 2, 3])),
        "feat": Features(np.array([[4, 5, 6], [7, 8, 9]])),
    },
    shared_dims="T",
)

>>> tf.shape
{'T': 3}

>>> tf.T[:2].shape
{'T': 2}
lostella commented 2 years ago

Some comments:

jaheba commented 2 years ago
  • A better title would be “arrays with named axes for storing data” or something similar, since “representation” may be ambiguous I think (could be interpreted as “mapping to an embedding space” which is totally unrelated to this)

Updated the title.

  • It would be good to frame this proposal within the user experience (either the “model user” or the “model developer”, or both), to understand what issue this solves, or what this enables: as I understand it, this would make for a more structured and descriptive type for “data entries” (currently dictionaries), so I get that a lot of data manipulation would become much clearer, but in any case it should be articulated in the RFC description I think, maybe with examples from which to work backwards

Agreed, let me work on this next.

  • np.ndarray already has a T attribute, see here, so we should be careful there. Maybe using extended names for the axes would be an option (“time”, “feature”, or any meaningful word for what an axis spans) but I’m not sure whether this would complicate things, for example about the indexing story

More meaningful descriptors probably help confusion, maybe we could have something like this:

data.ax.time[...]
# vs
data.T[...]
  • I didn’t completely get the TensorFrame example: what does the slicing operation return there? I though it would return some slice of the data, but a dictionary is displayed

I think this is a typo, let me fix it.

jaheba commented 2 years ago

To expand on the TensorFrame: The idea is to be able to handle multiple arrays which share common axis. So If I select the first n elements on the T axis, I slice all underlying arrays accordingly:

TimeSeries = Tensor["T"]

A = Tensor["TC"]
B = Tensor["CT"]

target = TimeSeries(np.array([1, 2, 3, 4]))

a = A(
    np.array(
        [
            [5, 6],
            [7, 8],
            [9, 10],
            [11, 12],
        ]
    )
)

b = B(np.array([["a", "b", "c", "d"], ["e", "f", "g", "h"]]))

tf = TensorFrame({"target": target, "a": a, "b": b})

tf2 = tf.T[:3]

tf2.get("target").values == [1, 2, 3]
tf2.get("a").values == [[5, 6], [7, 8], [9, 10]]
tf2.get("b").values == [["a", "b", "c"], ["e", "f", "g"]]
lostella commented 2 years ago

Is this relevant? https://xarray.pydata.org/en/stable/index.html

lostella commented 2 years ago

To expand on the TensorFrame: The idea is to be able to handle multiple arrays which share common axis. So If I select the first n elements on the T axis, I slice all underlying arrays accordingly:

One may need to also bundle arrays that do not share any axis. For example, static features do not have a time axis: in this case, I guess, slicing the TensorFrame along the time dimension should yield something that has the same “features” field as the original object.

jaheba commented 2 years ago

Is this relevant? https://xarray.pydata.org/en/stable/index.html

Might be, but I didn't find it intuitive to use. Something I found with pandas is that it is incredibly slow compared to numpy -- xarray might share the same fate.

jaheba commented 2 years ago

To expand on the TensorFrame: The idea is to be able to handle multiple arrays which share common axis. So If I select the first n elements on the T axis, I slice all underlying arrays accordingly:

One may need to also bundle arrays that do not share any axis. For example, static features do not have a time axis: in this case, I guess, slicing the TensorFrame along the time dimension should yield something that has the same “features” field as the original object.

Yes, but I haven't implemented these static fields yet. I think pandas supports something similar where you have properties.

jaheba commented 2 years ago

I've added code for Tensor here: #1877

kashif commented 2 years ago

My feeling is that instead of named axes the DL community has moved to adapt einsum/einops type notations for tensor operations

jaheba commented 2 years ago

I've done some further work on this.

My rework of the evaluation "just works" using Tensors instead of arrays without any code changes to the evaluation code:

TimeSeriesBatch = Tensor["n", "time"]
TimeSeriesSampleBatch = Tensor["n", "sample", "time"]

actual = TimeSeriesBatch([[1, 2, 3, 4], [5, 6, 7, 8]])

forecast = TimeSeriesSampleBatch(
    [
        [
            [1, 1, 1, 1],
            [2, 2, 2, 2],
        ],
        [
            [5, 5, 5, 5],
            [6, 6, 6, 6],
        ],
    ]
)

ev = Evaluator([AbsTargetSum(), ND()])
result = ev.apply({"target": actual}, forecast)

print(result.aggregate("time").select())
print(result.aggregate("n").select())

prints:

{'abs_target_sum': Tensor<n=2>, 'ND': Tensor<sample=2, n=2>}
{'abs_target_sum': Tensor<time=4>, 'ND': Tensor<sample=2, time=4>}

--

Further, I think this should simplify our transformation code.

For example, we can go from this:


@dataclass
class AddAgeFeature(MapTransformation):
    target_field: str
    output_field: str
    pred_length: int
    log_scale: bool = True
    dtype: DType = np.float32

    def map_transform(self, data: DataEntry, is_train: bool) -> DataEntry:
        length = target_transformation_length(
            data[self.target_field], self.pred_length, is_train=is_train
        )

        age = np.arange(length, dtype=self.dtype)

        if self.log_scale:
            age = np.log10(2.0 + age)

        data[self.feature_name] = age.reshape((1, length))

        return data

to

@dataclass
class AddAgeFeature(MapTransformation):
    output_field: str
    log_scale: bool = True
    dtype: DType = np.float32

    def map_transform(self, frame: TensorFrame) -> TensorFrame:
        age = np.arange(frame.shape["time"], dtype=self.dtype)

        if self.log_scale:
            age = np.log10(2.0 + age)

        frame[self.output_field] = TimeSeries(age)

        return frame

reducing the complexity significantly.