Based on the working group https://github.com/ITISFoundation/osparc-ops-environments/issues/672 we decided we will investigate these 3 options:

440871009_3694197630909287_7269496293591851454_n

(1) Importing from target deployment

Using an ad-hoc GUI the user can import thier projects from another deployment.

Prerequisits:

user must have an account in bouth source and destination deployments
user must authenticate with his credentials from source inside the destination deployment (this generates tokesn for the purpose of importing projects)

Chnages to oSPARC:

create endpoint for authenticating the user in another deployment
create endpoint for listing projects available to the user (maybe we can reuse soemething?)
create endpoint to start a copy (lock project): provides "project data" + "tokens to copy data from s3"
sumbit a job that "imports" the project: first sync data then insert project in db, if it fails remove data.
create endpoint for signal copy operation is done (unlocks project)

PROS:

not very complex, we rely on already existing tools and just generate a few ne API endpoints
potentially can be used internally to make a copy of an existing project (target the same deployment)
avoids creating a "data model" for exporting and importing user data by rclone copy S3 to S3

CONS:

user does not get access to their data (they can only move it from deployment A to deployment B)

(2) Archiving

Generate an archive containing project data and data stored in all nodes.

Prerequisits:

user must have an account in both source and desitnation deployments
user must have enough disk space to download the archive to his computer

Changes to oSPARC:

create endpoint for starting the export procedure
background job that creates the archive:
- donwload files and put them in an archive envtually compressing them + packaging the data model for the porject
- upload the arhcive to S3 (with an expiration)
- notify user (via email?) that the arhvie is available for download
solid upload process that is able to resume (require backend/FE coordination)
- split file into chunks
- retry if chunk fails to upload
- put chunks together in a unique file
import process (once file is available start import)
- check archive validity (nobody tamperred with it)
- extract data from the archive and upload to S3 (rollback on error)
- insert project in DB

PROS:

user has phisical copy of the data, by opening the archive he could extract a single file

CONS:

requires a third party computer (the user's) to download the arhive and upload the archive
uses two extra step form solution (1): archive creation and archive extraction
require more moving parts that:
- links that expire
- archive management: import + export
- there is one extra job queue (for exporting)

(3) Migration

The idea here is to migrate one deployment to another.

migrate S3 data
Database Migration (issues with autogenerated integer primary/foreign keys) - Potential solutions:
- Change the primary keys to randomly generated string IDs.
- Retain integer keys but artificially increase the integers by a large number.
- Change int to string and add some prefix (different prefix in different deployment)
- Almost all tables:
  - clusters, cluster_to_groups
  - comp_runs
  - comp_tasks
  - folders, workspaces
  - groups + all resources access rights
  - payments
  - resource tracker
  - pricing plans / units / costs
  - users
  - ...

PROS:

We will not face issues with migration between deployments in the future.

CONS:

It's a one-time full migration between deployments effort (not a feature for users as in previous cases)

### Tasks

Brainstorming on Sep.27, 2024

@giancarloromeo, @GitHK, @pcrespov, @matusdrobuliak66

There was no consensus on a clear preference for any of the proposed solutions above. Below some notes from the discussion

Data Migration from Source to Destination Database

When migrating data between databases, especially PostgreSQL tables with identifiers and relationships, it’s important to go beyond just viewing it as a transfer of data rows. The semantics of the data (i.e., the meaning of the entities and their relationships) must also be considered. Still, some of the key challenges can already be identified, particularly around merging data that exists in both the source and destination databases:

Key Challenges:

Integer Identifiers:
- Apply an offset to the source table IDs by adding the maximum ID value from the destination table to avoid conflicts.
- While it’s not mandatory, switching to more unique, descriptive identifiers (similar to Stripe-like IDs such as name_1456123456asdfa45) would be preferable.
Merging Existing Resources (e.g., Users, Products):
- Users: Handle records where users have the same email address in both source and destination databases.
- Products: Manage cases where products share the same product name across both databases.
- Group 1: Identify and handle additional resource overlaps.
Maintaining Dependencies (e.g., Groups):
- To preserve data integrity, ensure that related records (e.g., groups) are inserted in the correct order during migration. This guarantees that dependencies are maintained.

A Semantic Approach to Migration

Considering the database's structure and meaning, a more strategic approach is to break the migration into stages based on different contexts. This allows for grouping related tables and migrating them together, either manually or automatically.

Identified Contexts:

Platform Configurations:
- Clusters
- Products
- Product Prices
- (...)
Users:
- Users
- Wallets
- User Preferences (Frontend)
- (Additional user-related tables)
Services:
- Service Metadata
- Service Access Rights
- (Additional service-related tables)
Studies (Projects + Data):
- Projects
- Folders
- File Metadata
- (...)

Migration Process Requirements

Data Integrity Checks:
- Every step of the migration process must include validation checks to ensure data integrity, preventing corruption or data loss.
Checkpoints for Rollback:
- Implement checkpoints at various stages of the migration to allow for reversion in case a data integrity check fails, ensuring a safe fallback.

Features

Even thought his process will be mostly carried out once and in the backend, it might have a big value if the ability to import/export studies should be available as a standalone feature for users

ITISFoundation / osparc-simcore

Migration between deployments/Export project functionality #5824

(1) Importing from target deployment

Prerequisits:

Chnages to oSPARC:

PROS:

CONS:

(2) Archiving

Prerequisits:

Changes to oSPARC:

PROS:

CONS:

(3) Migration

Brainstorming on Sep.27, 2024

Key Challenges:

Identified Contexts:

(...)