Open jaheba opened 2 years ago
Some comments:
- A better title would be “arrays with named axes for storing data” or something similar, since “representation” may be ambiguous I think (could be interpreted as “mapping to an embedding space” which is totally unrelated to this)
Updated the title.
- It would be good to frame this proposal within the user experience (either the “model user” or the “model developer”, or both), to understand what issue this solves, or what this enables: as I understand it, this would make for a more structured and descriptive type for “data entries” (currently dictionaries), so I get that a lot of data manipulation would become much clearer, but in any case it should be articulated in the RFC description I think, maybe with examples from which to work backwards
Agreed, let me work on this next.
- np.ndarray already has a T attribute, see here, so we should be careful there. Maybe using extended names for the axes would be an option (“time”, “feature”, or any meaningful word for what an axis spans) but I’m not sure whether this would complicate things, for example about the indexing story
More meaningful descriptors probably help confusion, maybe we could have something like this:
data.ax.time[...]
# vs
data.T[...]
- I didn’t completely get the TensorFrame example: what does the slicing operation return there? I though it would return some slice of the data, but a dictionary is displayed
I think this is a typo, let me fix it.
To expand on the TensorFrame
: The idea is to be able to handle multiple arrays which share common axis. So If I select the first n
elements on the T
axis, I slice all underlying arrays accordingly:
TimeSeries = Tensor["T"]
A = Tensor["TC"]
B = Tensor["CT"]
target = TimeSeries(np.array([1, 2, 3, 4]))
a = A(
np.array(
[
[5, 6],
[7, 8],
[9, 10],
[11, 12],
]
)
)
b = B(np.array([["a", "b", "c", "d"], ["e", "f", "g", "h"]]))
tf = TensorFrame({"target": target, "a": a, "b": b})
tf2 = tf.T[:3]
tf2.get("target").values == [1, 2, 3]
tf2.get("a").values == [[5, 6], [7, 8], [9, 10]]
tf2.get("b").values == [["a", "b", "c"], ["e", "f", "g"]]
Is this relevant? https://xarray.pydata.org/en/stable/index.html
To expand on the TensorFrame: The idea is to be able to handle multiple arrays which share common axis. So If I select the first n elements on the T axis, I slice all underlying arrays accordingly:
One may need to also bundle arrays that do not share any axis. For example, static features do not have a time axis: in this case, I guess, slicing the TensorFrame along the time dimension should yield something that has the same “features” field as the original object.
Is this relevant? https://xarray.pydata.org/en/stable/index.html
Might be, but I didn't find it intuitive to use. Something I found with pandas is that it is incredibly slow compared to numpy -- xarray might share the same fate.
To expand on the TensorFrame: The idea is to be able to handle multiple arrays which share common axis. So If I select the first n elements on the T axis, I slice all underlying arrays accordingly:
One may need to also bundle arrays that do not share any axis. For example, static features do not have a time axis: in this case, I guess, slicing the TensorFrame along the time dimension should yield something that has the same “features” field as the original object.
Yes, but I haven't implemented these static fields yet. I think pandas supports something similar where you have properties
.
I've added code for Tensor
here: #1877
My feeling is that instead of named axes the DL community has moved to adapt einsum/einops type notations for tensor operations
I've done some further work on this.
My rework of the evaluation "just works" using Tensors
instead of arrays without any code changes to the evaluation code:
TimeSeriesBatch = Tensor["n", "time"]
TimeSeriesSampleBatch = Tensor["n", "sample", "time"]
actual = TimeSeriesBatch([[1, 2, 3, 4], [5, 6, 7, 8]])
forecast = TimeSeriesSampleBatch(
[
[
[1, 1, 1, 1],
[2, 2, 2, 2],
],
[
[5, 5, 5, 5],
[6, 6, 6, 6],
],
]
)
ev = Evaluator([AbsTargetSum(), ND()])
result = ev.apply({"target": actual}, forecast)
print(result.aggregate("time").select())
print(result.aggregate("n").select())
prints:
{'abs_target_sum': Tensor<n=2>, 'ND': Tensor<sample=2, n=2>}
{'abs_target_sum': Tensor<time=4>, 'ND': Tensor<sample=2, time=4>}
--
Further, I think this should simplify our transformation code.
For example, we can go from this:
@dataclass
class AddAgeFeature(MapTransformation):
target_field: str
output_field: str
pred_length: int
log_scale: bool = True
dtype: DType = np.float32
def map_transform(self, data: DataEntry, is_train: bool) -> DataEntry:
length = target_transformation_length(
data[self.target_field], self.pred_length, is_train=is_train
)
age = np.arange(length, dtype=self.dtype)
if self.log_scale:
age = np.log10(2.0 + age)
data[self.feature_name] = age.reshape((1, length))
return data
to
@dataclass
class AddAgeFeature(MapTransformation):
output_field: str
log_scale: bool = True
dtype: DType = np.float32
def map_transform(self, frame: TensorFrame) -> TensorFrame:
age = np.arange(frame.shape["time"], dtype=self.dtype)
if self.log_scale:
age = np.log10(2.0 + age)
frame[self.output_field] = TimeSeries(age)
return frame
reducing the complexity significantly.
WIP
Tensor
The core idea is to have a wrapper around nd-arrays where dimensions are not referenced by integers but single-letter names instead:
Here
TimeSeries
andCategories
are both 1-D arrays, however with incompatible dimensions.Features
is a 2-D array:When we operate on these tensors, we need to specify on which axes we want to operate on:
Indexing
Further, it's possible to add an index:
TensorFrame
A
TensorFrame
is a collection ofTensors
, which can share dimensions: