Texera / texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows
https://texera.github.io
Apache License 2.0
161 stars 68 forks source link

Enhance MongoDB Storage Mode with Dataset Statistics Calculation #2694

Closed mengw15 closed 2 months ago

mengw15 commented 2 months ago

Introduction


This PR enhances the functionality of our system by introducing a feature for calculating and displaying dataset information when the storage mode is set to MongoDB. The backend now computes statistics for datasets and sends this information to the frontend to be displayed in the stats row of result table (the first row of the result table).

Introduction


The primary goal of this feature is to provide detailed insights into the dataset directly from the MongoDB storage. The computations are optimized to handle data in batches, reflecting the batch-wise ingestion of data into MongoDB. This means that each update in MongoDB triggers calculations only on the newly added documents, rather than recalculating for all existing documents.

The calculated statistics include:

To avoid memory issues, if the number of categories in a categorical column exceeds 1000 in a batch, only the top 1000 categories (sorted by frequency) are considered for calculation. This information, including the approximated nature of the statistics, is communicated to the frontend.

Detailed Changes


Backend Changes:

Frontend Changes:

Current PR contains the frontend codes, and the backend codes are in this PR #2693

By implementing this feature, users can now get comprehensive statistical insights into datasets, enabling better data analysis and decision-making.

Demo


https://github.com/Texera/texera/assets/86388854/05cdebb4-c7c2-4e64-bd69-53a174a00ca3