This PR enhances the functionality of our system by introducing a feature for calculating and displaying dataset information when the storage mode is set to MongoDB. The backend now computes statistics for datasets and sends this information to the frontend to be displayed in the stats row of result table (the first row of the result table).
Introduction
The primary goal of this feature is to provide detailed insights into the dataset directly from the MongoDB storage. The computations are optimized to handle data in batches, reflecting the batch-wise ingestion of data into MongoDB. This means that each update in MongoDB triggers calculations only on the newly added documents, rather than recalculating for all existing documents.
The calculated statistics include:
Numerical Columns: Mean, Maximum, Minimum
Date Columns: Maximum Date, Minimum Date
Categorical Columns: The most frequent category, the second most frequent category, their proportions, and the proportion of all other categories combined (Others).
To avoid memory issues, if the number of categories in a categorical column exceeds 1000 in a batch, only the top 1000 categories (sorted by frequency) are considered for calculation. This information, including the approximated nature of the statistics, is communicated to the frontend.
Detailed Changes
Backend Changes:
Computation of statistics for numerical, date, and categorical columns.
Optimization of calculations to handle data in batches.
Handling large categorical columns by limiting to the top 1000 categories per batch.
Frontend Changes:
Display of computed statistics in the stats row of the table.
Indication of approximate statistics when applicable.
The displayed stats will update with each new batch of data entering MongoDB, and changes will be highlighted in red.
Current PR contains the frontend codes, and the backend codes are in this PR #2693
By implementing this feature, users can now get comprehensive statistical insights into datasets, enabling better data analysis and decision-making.
Introduction
This PR enhances the functionality of our system by introducing a feature for calculating and displaying dataset information when the storage mode is set to MongoDB. The backend now computes statistics for datasets and sends this information to the frontend to be displayed in the stats row of result table (the first row of the result table).
Introduction
The primary goal of this feature is to provide detailed insights into the dataset directly from the MongoDB storage. The computations are optimized to handle data in batches, reflecting the batch-wise ingestion of data into MongoDB. This means that each update in MongoDB triggers calculations only on the newly added documents, rather than recalculating for all existing documents.
The calculated statistics include:
To avoid memory issues, if the number of categories in a categorical column exceeds 1000 in a batch, only the top 1000 categories (sorted by frequency) are considered for calculation. This information, including the approximated nature of the statistics, is communicated to the frontend.
Detailed Changes
Backend Changes:
Frontend Changes:
Current PR contains the frontend codes, and the backend codes are in this PR #2693
By implementing this feature, users can now get comprehensive statistical insights into datasets, enabling better data analysis and decision-making.
Demo
https://github.com/Texera/texera/assets/86388854/05cdebb4-c7c2-4e64-bd69-53a174a00ca3