Data Format Changes

Introduction:

The current data format used in our application has some limitations, such as a lack of clarity and the need for expensive calculated columns to be generated on the browser. To address these issues, we are proposing a new data format that follows best practices such as having a single source of truth and separating the UI presentation layer from the logic layer. The new format will also improve performance by reducing the need for calculated columns and provide a clearer data structure. In this specification, we will outline the proposed changes to the data format and provide examples of how it will be implemented.

The actual format:
- The actual format provide only the data, and the UI information like timestamps or aggregation columns are generated by the frontend on the fly, at each request.
- This approach is not ideal because it requires the frontend to perform expensive calculations, which could lead to poor performance and a poor user experience.
- Additionally, it goes against the separation of concerns principle, as the frontend is responsible for both presenting the data and performing logic on it.

{
  "index": [0, 1, 2, 3, 4],
  "columns": [0, 1],
  "data": [
    [0, 0.5],
    [1, 0.75],
    [2, 1],
    [3, 1.25],
    [4, 1.5]
  ]
}

And here's an example of the new format:

{
  "dateTime": ["2022-01-01T00:00:00Z", "2022-01-02T00:00:00Z", "2022-01-03T00:00:00Z"],
  "data": [
    [20, 30],
    [25, 35],
    [22, 32]
  ],
"rowHeaders": [
      "January",
      "February",
      "March",
      "April",
      "May",
      "June",
      "July",
      "August",
      "September",
      "October",
      "November",
      "December",
    ],
  "aggregates": {
    "min": [20, 30],
    "max": [25, 35],
    "avg": [22, 32]
  },
  "columns": [
    {
      "id": "ts1",
      "title": "TS 1",
      "type": "number",
      "width": 100,
      "editable": false
    },
    {
      "id": "ts2",
      "title": "TS 2",
      "type": "number",
      "width": 100,
      "editable": false
    }
  ],
  "rows": 3,
  "metadata": {
    "defaultValue": 0,
    "kind": "daily",
    "title": "Daily Thermal Time Series"
  }
}

In the new format, the dateTime array contains the timestamps for each row of data, the data array contains the actual data values, the aggregates object contains the minimum, maximum, and average values for each column, the columns array contains metadata for each column (such as its title, data type, and width), and the metadata object contains additional metadata for the entire dataset (such as its default value, kind, and title).

The new format is more explicit and easier to understand, as it separates the data, metadata, and UI presentation logic into distinct sections. It also allows for more efficient data processing, as the aggregates and column metadata can be calculated once on the server and then reused by the client, rather than being recalculated every time the data is rendered.

Format changes:
- The new format includes the following properties:
  - dateTime: array of strings in ISO 8601 date format, example: 2024-07-01T00:00:00Z representing the date and time for each row of data:
```
"dateTime": [
 "2024-07-01T00:00:00Z",
 "2024-07-02T00:00:00Z",
]
```
  - data: a 2D array of numbers representing the data for each row and column. For example:
```
"data": [
 [1, 2, 3, 4, 5],
 [6, 7, 8, 9, 10],
]
```
  - aggregates: an object that includes the following properties:
  - min: an array of numbers representing the minimum value for each column. For example:
```
"aggregates": {
"min": [1, 6, 11, 16, 21],
}
```
  - max: an array of numbers representing the maximum value for each column. For example:
```
"aggregates": {
"max": [5, 10, 15, 20, 25],
}
```
  - avg: an array of numbers representing the average value for each column. For example:
```
"aggregates": {
"avg": [3, 8, 13, 18, 23],
}
```
  - columns: an array of objects that includes the following properties for each column:
  - Note that the columns object shape is specific to GlideDataGrid, see the columns docs
  - id: a string representing the unique identifier for the column. For example:
```
"columns": [
{
 "id": "datetime",
},
]
```
  - title: a string representing the display name for the column. For example:
```
"columns": [
{
 "title": "Date / Time",
},
]
```
  - type: a string representing the type for the column. The purpose of this type is to differentiate UI columns that are readonly, and have special styling, from pure data columns that are editable
  - str Enum: "datetime" | "number" | "aggregate" (may change)
  - For example:
```
"columns": [
{
 "type": "datetime",
},
]
```
  - width: a number representing the width of the column in pixels. For example:
```
"columns": [
{
 "width": 150,
},
]
```
  - editable: a boolean indicating whether the column is editable. For example:
```
"columns": [
{
 "editable": false,
},
]
```
  - style: an optional string representing the CSS style for the column. For example:
  - str Enum: "normal" | "highlight" if not provided defaults to "normal".
```
"columns": [
{
 "style": "highlight",
},
]
```
  - rows: a number representing the number of rows in the data. For example:
  - Required prop for GlideDataGrid, see required props docs
```
"rows": 10
```
  - metadata: an object that includes additional metadata for the data.
  - defaultValue: a number that indicates the default value to fill the matrix when a resize is performed (this value may change depending on the kind of matrix)
  - kind: the kinf of matrix e.g.: "hydroStorage", "waterValues", "allocation". Enables the possibility to switch some features depending on the kind of matrix, or simply identify it.
  - title: displayed title of the matrix
```
"metadata": {
 "defaultValue": 1,
 "kind": "hourly",
 "title": "Hourly Temperature Data",
}
```
Benefits of changing the format:
- Respect of good practices: The new format follows the principle of a single source of truth, as the data is stored and managed in a central location (the backend) and is presented to the user in a consistent and accurate way.
- Performance: By including the aggregates and column metadata in the data payload, the frontend can present the data more efficiently and accurately, without having to perform expensive calculations on the fly.
- Clarity: The new format provides a straightforward data structure that is easy to understand and work with, both for the frontend and backend developers.
- Separation of concerns: The new format separates the data from the UI presentation logic, which improves the separation of concerns and makes the codebase more maintainable.

Here's an example of Pydantic classes describing the new format:


class Column(BaseModel):
    id: str
    title: str
    type: str
    width: int
    editable: bool
    style: str = None

class Aggregates(BaseModel):
    min: List[float]
    max: List[float]
    avg: List[float]

class Metadata(BaseModel):
    defaultValue: float
    kind: str
    title: str

class Matrix(BaseModel):
    dateTime: List[str]
    data: List[List[float]]
    aggregates: Aggregates
    columns: List[Column]
    rows: int
    metadata: Metadata

About the Apache Arrow format

Arrow is a columnar memory format designed for efficient in-memory data processing and interchange. It was developed by Apache Arrow, an open-source project that aims to improve the performance of big data processing applications.

Arrow is a columnar memory format designed for efficient in-memory data processing and interchange.
In a columnar format, data is stored in columns rather than rows, which can improve performance for certain types of data processing tasks and enable more efficient storage of data.
Arrow uses a columnar format to store data in memory, and it also provides a binary format for serializing and deserializing data.
The Arrow IPC format is a binary format that is optimized for serializing and deserializing data in a columnar format. It encodes each column using a specific encoding scheme that is optimized for the data type, and then concatenates the encoded columns together to form a single binary blob.
The Arrow IPC format provides a efficient and flexible way to serialize and deserialize data in a columnar format, allowing for efficient data processing and interchange across different systems and languages.

Data flow Frontend -> Backend : (frontend makes a PUT request to update a matrix)

The frontend generates a JSON object that represents the data.
The JSON object is converted to an Arrow Table
The Arrow Table is serialized to a binary format using the tableToIPC method and sent to the backend via a POST request.
The backend deserializes the binary data to an Arrow Table.
FastAPI validates the incoming data using a Pydantic model.
The data is processed using pyarrow, and pandas.
The processed data is stored.

arrow-front-to-back

Data flow Backend -> Frontend: frontend makes a GET request to read a matrix content)

The frontend sends an HTTP GET request to the backend to read a matrix content.
The backend retrieves the stored data and processes it using tools like pyarrow.compute, and pandas.
The processed data is converted to an Arrow Table.
The Arrow Table is serialized to a binary format.
The serialized binary data is sent to the frontend using an HTTP response.
The frontend deserializes the binary data to an Arrow Table using the tableFromIPC method from the apache-arrow library.
The Arrow Table is converted to a JSON object.
The JSON object is used to populate the frontend UI with the matrix content.

arrow-back-to-front

AntaresSimulatorTeam / AntaREST

Revamping Matrices: Data Format Changes and Integration of Apache Arrow IPC Binary Format #2086

Data Format Changes

Introduction:

About the Apache Arrow format

Resources:

Apache Arrow in JS

Apache Arrow Usage with Pandas

Python Cookbook