Open coleslaw481 opened 7 years ago
Digging in code found this sql statement to determine what datasets have NOT been archived:
select mpid, archived_date from microscopy_products where archived_date is not null
Above extracted from Old_CCDB_Projects/Data_browser_servlet/src/java/edu/ucsd/ccdb/ncmir/data_browser/servlet/DBUtil.java getArchivedMPIDs() method.
To archive looks like the following command needs to be run on a specific server, not sure what user:
ccdb_archivedata <MPID> /telescience/home/CCDB_DATA_USER.portal/CCDB_DATA_USER/acquisition/project_"+projectID+"/microscopy_"+mpid+"/rawdata
Mostly putting this in here so I don't have to remember it all later.
That select looks wrong, wouldn't you want the results where archived_date is NULL?
That command (ccdb_archivedata) takes 2 paths physical_source_path and a destination_irods_path. The portal can tell you what project if you have a mpid via http://ccdb-portal.crbs.ucsd.edu:8081/Data_browser_servlet/getProjectID.jsp?mpid=<$mpid> (you have to scrape the contents of
, it returns html)Since I just wrote a script for Willy to find data in the archive I decided I might as well complete the suite so I have 3 goodies for you. https://irods-api.crbs.ucsd.edu/find_dj?scope=fv1000_2 (returns all unarchived things for a scope) https://irods-api.crbs.ucsd.edu/find_dj?user=keunyoung (returns all unarchived things for a user) [These first 2 will return some things that have already been archived, that's because there is about 7 TB of cruft in datajail, you likely won't need it anyway, I only added it because it was copy pasta and change a var or two]
https://irods-api.crbs.ucsd.edu/find_dj?mpid=76154 (returns the path in datajail if you have it's mpid)
so your script should only need an mpid to work i.e. archive 1234 script finds 1234 in datajail using above method so it has the source path script asks portal for project so it can construct the irods destination script drops the job in beanstalk profit!
One thing of note all three work off /ccdbprod/ccdbproddj0/$scope/$user/CCDBID_$mpid/CCDB_MetaData/ if that CCDB_MetaData dir doesn't exist (I think its created by the id generator) or the pathing is different it won't find it. since everything recent got to DJ by script that should never be a problem, but now you know.
In the very near future users won't need to "archive" but they will need to tell us when they are done processing. I already wrote that bit here: https://jill.crbs.ucsd.edu/mpid/finish it just changes the status in MQTT to PROCESSED and gets the processed data moved into the archive all in the background. If they want to be able to do that from the command line I think I wrote it so you can just throw an mpid at it but I'd have to peek under the hood to verify that. It does do some checks to ensure the MPID is already in an acceptable state to be archived (like its not still on the scope or already archived.)
Also now that I think about it in the future all this info will be in MQTT i.e. ccdb/56314/state SYNCED:win-ds-test_ccdbuser_CCDBID_56314
That gives us the state and the path in datajail which will be renamed receiving and will be read only but mounted and shared via samba/nfs
Not sure on the name to use, but this tool should list all datasets in data jail and let the user optionally archive those datasets.
Behind the scenes this tool needs to do the following steps which are done via the portal: