Open hdinia opened 3 months ago
Thanks for the description! As discussed a few days ago, the proposed data model seems very good to me as the data model on the front side.
Then, in order to advance the dev step by step we would:
As a first step, re-create this data model from the current backend endpoints, and use data glide from it. This will allow us to validate the use of data glide without having to do heavy refactorings on the backend
As a second step, improve the way this data model is populated from the backend:
About aggregates computation on the backend side, this would be aded in again another step I think. In particular, it will probably not be possible to have them inside the same arrow structure as the matrix content. So we need to think of how this could be transferred: either another endpoint (but then we should take care of caching the values on the backend side to not reload the file just for computing those values), or finding a way to return in the same body the arrow table and those values.
About matrix updates from front to back
Currently, those updates are specified through operations which have the advantage of not sending the whole matrix to the backend. I thinks in a first step we should keep this mechanism. Then we should assess if it's OK to send the whole matrix every time as an arrow table or not. Or maybe having several endpoints for different kinds of operations (updating only a few cells vs. updating a whole subpart of the table).
Data Format Changes
Introduction:
The current data format used in our application has some limitations, such as a lack of clarity and the need for expensive calculated columns to be generated on the browser. To address these issues, we are proposing a new data format that follows best practices such as having a single source of truth and separating the UI presentation layer from the logic layer. The new format will also improve performance by reducing the need for calculated columns and provide a clearer data structure. In this specification, we will outline the proposed changes to the data format and provide examples of how it will be implemented.
timestamps
oraggregation
columns are generated by the frontend on the fly, at each request.And here's an example of the new format:
In the new format, the
dateTime
array contains the timestamps for each row of data, thedata
array contains the actual data values, theaggregates
object contains the minimum, maximum, and average values for each column, thecolumns
array contains metadata for each column (such as its title, data type, and width), and themetadata
object contains additional metadata for the entire dataset (such as its default value, kind, and title).The new format is more explicit and easier to understand, as it separates the data, metadata, and UI presentation logic into distinct sections. It also allows for more efficient data processing, as the aggregates and column metadata can be calculated once on the server and then reused by the client, rather than being recalculated every time the data is rendered.
Format changes:
The new format includes the following properties:
dateTime
: array of strings inISO 8601
date format, example:2024-07-01T00:00:00Z
representing the date and time for each row of data:data
: a 2D array of numbers representing the data for each row and column. For example:aggregates
: an object that includes the following properties:min
: an array of numbers representing the minimum value for each column. For example:max
: an array of numbers representing the maximum value for each column. For example:avg
: an array of numbers representing the average value for each column. For example:columns
: an array of objects that includes the following properties for each column:id
: a string representing the unique identifier for the column. For example:title
: a string representing the display name for the column. For example:type
: a string representing the type for the column. The purpose of this type is to differentiate UI columns that are readonly, and have special styling, from pure data columns that are editablestr Enum
:"datetime" | "number" | "aggregate"
(may change)For example:
width
: a number representing the width of the column in pixels. For example:editable
: a boolean indicating whether the column is editable. For example:style
: an optional string representing the CSS style for the column. For example:str Enum
:"normal" | "highlight"
if not provided defaults to "normal".rows
: a number representing the number of rows in the data. For example:metadata
: an object that includes additional metadata for the data.defaultValue
: a number that indicates the default value to fill the matrix when a resize is performed (this value may change depending on the kind of matrix)kind
: the kinf of matrix e.g.: "hydroStorage", "waterValues", "allocation". Enables the possibility to switch some features depending on the kind of matrix, or simply identify it.title
: displayed title of the matrixBenefits of changing the format:
Respect of good practices: The new format follows the principle of a single source of truth, as the data is stored and managed in a central location (the backend) and is presented to the user in a consistent and accurate way.
Performance: By including the aggregates and column metadata in the data payload, the frontend can present the data more efficiently and accurately, without having to perform expensive calculations on the fly.
Clarity: The new format provides a straightforward data structure that is easy to understand and work with, both for the frontend and backend developers.
Separation of concerns: The new format separates the data from the UI presentation logic, which improves the separation of concerns and makes the codebase more maintainable.
Here's an example of
Pydantic
classes describing the new format:About the Apache Arrow format
Arrow is a columnar memory format designed for efficient in-memory data processing and interchange. It was developed by Apache Arrow, an open-source project that aims to improve the performance of big data processing applications.
Data flow Frontend -> Backend : (frontend makes a
PUT
request to update a matrix)tableToIPC
method and sent to the backend via a POST request.pyarrow
, andpandas
.Data flow Backend -> Frontend: frontend makes a
GET
request to read a matrix content)pyarrow.compute
, andpandas
.tableFromIPC
method from theapache-arrow
library.Resources:
Apache Arrow in JS
Apache Arrow Usage with Pandas
Python Cookbook