Texera / texera

Collaborative Machine-Learning-Centric Data Analytics Using Workflows
https://texera.github.io
Apache License 2.0
160 stars 68 forks source link

Remove Environment and associate Dataset with Operator directly #2719

Closed bobbai00 closed 1 month ago

bobbai00 commented 1 month ago

This PR removes the concept Environment. The latest ER-diagram of our system should be:

Screenshot 2024-07-06 at 2 40 33 PM

Here are graphs to show the before-after, also the delta:

Texera-ER-before-after

texera-architecture-compare-delta

In order to incorporate this ER diagram, several designs are implemented.

0. DDL Changes

Tables and columns related to environment are now removed

use `texera_db`;

ALTER TABLE `environment_of_workflow`
    DROP FOREIGN KEY `environment_of_workflow_ibfk_2`;

ALTER TABLE `dataset_of_environment`
    DROP FOREIGN KEY `dataset_of_environment_ibfk_1`;

ALTER TABLE `dataset_of_environment`
    DROP FOREIGN KEY `dataset_of_environment_ibfk_2`;

ALTER TABLE `workflow_executions`
    DROP FOREIGN KEY `workflow_executions_ibfk_3`;
-- Dropping the dependent tables
DROP TABLE IF EXISTS `environment_of_workflow`;
DROP TABLE IF EXISTS `dataset_of_environment`;

-- Dropping the environment table
DROP TABLE IF EXISTS `environment`;

ALTER TABLE `workflow_executions`
    DROP COLUMN `environment_eid`;

1. File URI

To uniquely identify a file globally, the URI is now in the format of:

/${ownerEmail}/${datasetName}/${versionName}/${fileRelativePath}

For example, for user texera@uci.edu, a file tweets/california.csv in dataset Twitter, version v3. The URI is:

/texera@uci.edu/Twitter/v3/tweets/california.csv

2. Workflow Sharing

When sharing a workflow to another user, only the files scanned by this workflow are visible to the users. The dataset access will NOT be propagated through this sharing

Demo

First of all, the environment tab is removed:

Screenshot 2024-07-06 at 2 45 19 PM

To scan a source file, users no need to add datasets explicitly to somewhere, user can select files from all accessible datasets and versions.

Screenshot 2024-07-06 at 3 00 24 PM

2024-07-06 14 46 49

Future Work

Currently, datasets are loaded fully and frequently. This can result in the GUI not responsive and fluent enough. An ongoing effort led by one of our team member is to improve the file selection interaction logic. This PR can make the loading lazily happen, and make the search of datasets & file much more easier

chenlica commented 1 month ago

@bobbai00 : the diagrams are very nice. Per our discussion, please add comments about how to improve the interface to allow a user to select a file (as a future PR).