Computational-Plant-Science / plantit

https://plantit.cyverse.org
BSD 3-Clause "New" or "Revised" License
11 stars 5 forks source link

DIRT migration #301

Open wpbonelli opened 2 years ago

wpbonelli commented 2 years ago

Each DIRT user may have any number of image sets. We need to prompt them to transfer their image sets to correspondingly named folders in the data store, so that they can then use plantit to run DIRT. We can prompt the user via:

We should also have a dedicated page for this in the documentation (walkthrough + screenshots).

When the user begins, we first detect if they have any DIRT image sets. If so, for each, transfer files from tucco’s attached NFS to a smaller temporary staging area (also an NFS) on portnoy, then transfer to its own folder in the user’s home directory in the data store, with folder name as DIRT image set name and file names preserved. We should also decorate datasets with a metadata tag indicating DIRT origin, as well as any attached metadata. We should show some kind of progress monitor in the UI, then send an email notification to the user.

Data transfer via:

IMPT: make a final backup of all the DIRT datasets before the migration period ends

wpbonelli commented 2 years ago

A minimal version of this is working now. Triggered from the top-right dropdown in the UI.

Remaining tasks:

It may also be worthwhile to let users select a folder to transfer their data to, or create a new one, rather than creating and using one with hard-coded path /iplant/home/{username}/dirt_migration as we currently do. It seems unlikely anybody will already have a folder with the same name so we are probably safe from collisions, but still.

wpbonelli commented 2 years ago

In as of 92d06ed

wpbonelli commented 2 years ago

Reopening because we should preserve collection names and collection & image metadata. This can be pulled from the DIRT database given an image path. Because files are stored by date on the tucco NFS, rather than by collection membership, we need to do a lookup for each file separately. We may also need to lookup usernames, since some users' data seems to be associated with their full name rather than their CyVerse username.

SQL queries

The Drupal CMS produces a fairly unwieldy database schema where every object is a node, i.e. an entity, and each node's associated information and metadata are scattered around various tables linked via foreign key. We will need a number of queries to extract relevant information:

select fid from file_managed where uri like 'public://{path}%';

select entity_id from field_data_field_root_image where field_root_image_fid = {field_root_image_fid};

select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_resolution where entity_id = {entity_id};
select * from field_data_field_root_img_age where entity_id = {entity_id};
select * from field_data_field_root_img_dry_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_family where entity_id = {entity_id};
select * from field_data_field_root_img_fresh_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_genus where entity_id = {entity_id};
select * from field_data_field_root_img_spad where entity_id = {entity_id};
select * from field_data_field_root_img_species where entity_id = {entity_id};

select * from field_data_field_marked_coll_root_img_ref where field_marked_coll_root_img_ref_target_id = {entity_id};

select * from node where nid = {entity_id};

select * from field_data_field_collection_metadata where entity_id = {entity_id};
select * from field_data_field_collection_location where entity_id = {entity_id};
select * from field_data_field_collection_plantation where entity_id = {entity_id};
select * from field_data_field_collection_harvest where entity_id = {entity_id;}
select * from field_data_field_collection_soil_group where entity_id = {entity_id};
select * from field_data_field_collection_soil_moisture where entity_id = {entity_id};
select * from field_data_field_collection_soil_nitrogen where entity_id = {entity_id};
select * from field_data_field_collection_soil_phosphorus where entity_id = {entity_id};
select * from field_data_field_collection_soil_potassium where entity_id = {entity_id};
select * from field_data_field_collection_pesticides where entity_id = {entity_id};
wpbonelli commented 2 years ago

Depends on #312

wpbonelli commented 2 years ago

Occasionally the Celery process running the migration gets killed for excess memory usage, e.g.: Process 'ForkPoolWorker-2' pid:17 exited with 'signal 9 (SIGKILL)'

Might need to give the Celery container more memory

Update: could be a Paramiko memory leak where the client and transport don't clean up after themselves properly. Currently we keep a client open for the entire migration and reuse it for every SFTP download. Might be able to resolve this by opening/closing a new client for each file transferred, at risk of slowing things down a bit due to overhead.

wpbonelli commented 2 years ago

Refactor in progress to use Celery's eventlet scheduler for non-blocking IO. This should dramatically speed up data transfer as we can perform a large number of file downloads/uploads in parallel instead of serially.