Open wpbonelli opened 2 years ago
A minimal version of this is working now. Triggered from the top-right dropdown in the UI.
Remaining tasks:
It may also be worthwhile to let users select a folder to transfer their data to, or create a new one, rather than creating and using one with hard-coded path /iplant/home/{username}/dirt_migration
as we currently do. It seems unlikely anybody will already have a folder with the same name so we are probably safe from collisions, but still.
Reopening because we should preserve collection names and collection & image metadata. This can be pulled from the DIRT database given an image path. Because files are stored by date on the tucco
NFS, rather than by collection membership, we need to do a lookup for each file separately. We may also need to lookup usernames, since some users' data seems to be associated with their full name rather than their CyVerse username.
The Drupal CMS produces a fairly unwieldy database schema where every object is a node
, i.e. an entity, and each node's associated information and metadata are scattered around various tables linked via foreign key. We will need a number of queries to extract relevant information:
fid
given image path:select fid from file_managed where uri like 'public://{path}%';
entity_id
given field_root_image_fid
select entity_id from field_data_field_root_image where field_root_image_fid = {field_root_image_fid};
entity_id
select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_metadata where entity_id = {entity_id};
select * from field_data_field_root_image_resolution where entity_id = {entity_id};
select * from field_data_field_root_img_age where entity_id = {entity_id};
select * from field_data_field_root_img_dry_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_family where entity_id = {entity_id};
select * from field_data_field_root_img_fresh_biomass where entity_id = {entity_id};
select * from field_data_field_root_img_genus where entity_id = {entity_id};
select * from field_data_field_root_img_spad where entity_id = {entity_id};
select * from field_data_field_root_img_species where entity_id = {entity_id};
entity_id
given image field_marked_coll_root_img_ref_target_id
(entity_id
)select * from field_data_field_marked_coll_root_img_ref where field_marked_coll_root_img_ref_target_id = {entity_id};
entity_id
(nid
, node ID)select * from node where nid = {entity_id};
entity_id
select * from field_data_field_collection_metadata where entity_id = {entity_id};
select * from field_data_field_collection_location where entity_id = {entity_id};
select * from field_data_field_collection_plantation where entity_id = {entity_id};
select * from field_data_field_collection_harvest where entity_id = {entity_id;}
select * from field_data_field_collection_soil_group where entity_id = {entity_id};
select * from field_data_field_collection_soil_moisture where entity_id = {entity_id};
select * from field_data_field_collection_soil_nitrogen where entity_id = {entity_id};
select * from field_data_field_collection_soil_phosphorus where entity_id = {entity_id};
select * from field_data_field_collection_soil_potassium where entity_id = {entity_id};
select * from field_data_field_collection_pesticides where entity_id = {entity_id};
Depends on #312
Occasionally the Celery process running the migration gets killed for excess memory usage, e.g.: Process 'ForkPoolWorker-2' pid:17 exited with 'signal 9 (SIGKILL)'
Might need to give the Celery container more memory
Update: could be a Paramiko memory leak where the client and transport don't clean up after themselves properly. Currently we keep a client open for the entire migration and reuse it for every SFTP download. Might be able to resolve this by opening/closing a new client for each file transferred, at risk of slowing things down a bit due to overhead.
Refactor in progress to use Celery's eventlet scheduler for non-blocking IO. This should dramatically speed up data transfer as we can perform a large number of file downloads/uploads in parallel instead of serially.
Each DIRT user may have any number of image sets. We need to prompt them to transfer their image sets to correspondingly named folders in the data store, so that they can then use
plantit
to run DIRT. We can prompt the user via:We should also have a dedicated page for this in the documentation (walkthrough + screenshots).
When the user begins, we first detect if they have any DIRT image sets. If so, for each, transfer files from
tucco
’s attached NFS to a smaller temporary staging area (also an NFS) onportnoy
, then transfer to its own folder in the user’s home directory in the data store, with folder name as DIRT image set name and file names preserved. We should also decorate datasets with a metadata tag indicating DIRT origin, as well as any attached metadata. We should show some kind of progress monitor in the UI, then send an email notification to the user.Data transfer via:
iCommands container?IMPT: make a final backup of all the DIRT datasets before the migration period ends