Create tool to archive datasets in data jail

coleslaw481 commented 7 years ago

Not sure on the name to use, but this tool should list all datasets in data jail and let the user optionally archive those datasets.

Behind the scenes this tool needs to do the following steps which are done via the portal:

go to the portal
press the archive button
calls ccdbprod_archive (one of Sean’s scripts)
ccdbprod_archive drops the job in a beanstalk queue

coleslaw481 commented 7 years ago

Digging in code found this sql statement to determine what datasets have NOT been archived:

select mpid, archived_date from microscopy_products where archived_date is not null

Above extracted from Old_CCDB_Projects/Data_browser_servlet/src/java/edu/ucsd/ccdb/ncmir/data_browser/servlet/DBUtil.java getArchivedMPIDs() method.

coleslaw481 commented 7 years ago

To archive looks like the following command needs to be run on a specific server, not sure what user:

ccdb_archivedata <MPID> /telescience/home/CCDB_DATA_USER.portal/CCDB_DATA_USER/acquisition/project_"+projectID+"/microscopy_"+mpid+"/rawdata

SyBernot commented 6 years ago

Mostly putting this in here so I don't have to remember it all later.

That select looks wrong, wouldn't you want the results where archived_date is NULL?

That command (ccdb_archivedata) takes 2 paths physical_source_path and a destination_irods_path. The portal can tell you what project if you have a mpid via http://ccdb-portal.crbs.ucsd.edu:8081/Data_browser_servlet/getProjectID.jsp?mpid=<$mpid> (you have to scrape the contents of , it returns html)

Since I just wrote a script for Willy to find data in the archive I decided I might as well complete the suite so I have 3 goodies for you. https://irods-api.crbs.ucsd.edu/find_dj?scope=fv1000_2 (returns all unarchived things for a scope) https://irods-api.crbs.ucsd.edu/find_dj?user=keunyoung (returns all unarchived things for a user) [These first 2 will return some things that have already been archived, that's because there is about 7 TB of cruft in datajail, you likely won't need it anyway, I only added it because it was copy pasta and change a var or two]

https://irods-api.crbs.ucsd.edu/find_dj?mpid=76154 (returns the path in datajail if you have it's mpid)

so your script should only need an mpid to work i.e. archive 1234 script finds 1234 in datajail using above method so it has the source path script asks portal for project so it can construct the irods destination script drops the job in beanstalk profit!

One thing of note all three work off /ccdbprod/ccdbproddj0/$scope/$user/CCDBID_$mpid/CCDB_MetaData/ if that CCDB_MetaData dir doesn't exist (I think its created by the id generator) or the pathing is different it won't find it. since everything recent got to DJ by script that should never be a problem, but now you know.

In the very near future users won't need to "archive" but they will need to tell us when they are done processing. I already wrote that bit here: https://jill.crbs.ucsd.edu/mpid/finish it just changes the status in MQTT to PROCESSED and gets the processed data moved into the archive all in the background. If they want to be able to do that from the command line I think I wrote it so you can just throw an mpid at it but I'd have to peek under the hood to verify that. It does do some checks to ensure the MPID is already in an acceptable state to be archived (like its not still on the scope or already archived.)

Also now that I think about it in the future all this info will be in MQTT i.e. ccdb/56314/state SYNCED:win-ds-test_ccdbuser_CCDBID_56314

That gives us the state and the path in datajail which will be renamed receiving and will be read only but mounted and shared via samba/nfs

CRBS / ncmirtools

Create tool to archive datasets in data jail #12