Open dkirkby opened 9 years ago
I think https://github.com/dkirkby/bossdata/issues/52 has bearing on this. Reversing part of my comment there, I'm thinking e.g. bosslocal
is expanded to a general e.g. bossmgr
util that allows for management of the local 'raw' files, the DB's (addition, removal, listing of indexes), archiving/deletion, etc. from one cmd line util.
I agree that the new command-line tool should know about the raw files that backup each sqlite db and be able to manage them intelligently. For example, a --prune
option might delete the raw file as long as the db file is present. Perhaps bosslocalmgr
for the name?
Waiting on my virtual machine drive to backup so I can resize it is giving me ample free time; here is a rough rundown on functionality I can think of, part plan, part question, part wishlist:
Features as actually implemented:
--total-files
--delete
find
-like options: -amin
, -atime
, -cmin
, -ctime
-filename
--delete-test
(perhaps change to --delete-dry-run
?)--delete
will quite happily delete all your many hours/GB's of downloaded data files. Be careful!--total-files
work with same targeting options as --delete
--db-index-print
--db-index-update FILENAME
, where FILENAME points at a file consisting of lines formatted like: catalog_name.index_name(column_name_1,column_name_2,...,column_name_n)
--db-index-reset
--db-compact
causes DB files to be resized (useful if indexes have been deleted and no new equivalent ones recreated; this frees space that would otherwise not be automatically freed.)--prune
removed metadata files that have already been imported into DB'sbosslocalmgr
settings
--verbose
, as usual--catalogs
is a comma delimited list specifying the catalogs (DBs) the DB commands will work on, e.g. --catalogs FULL,QUASAR
--delete
in future version--env
dumps environment vars related bossdata
for those who don't like grepset-env
temporarily overrides environment settings with the defaults for the specified CATALOG name. Originally intended to allow setting of parent process/shell environment, but that is on hold (maybe indefinitely). Kept option to allow change of process environment before any other commands are processed.Next Iteration
--delete
does not remove empty directories; it should--delete
could probably use a little more protection: "Are you sure?"-style checks, double confirms on deleting DB files; or, display a summary (a la --delete-test
in compact form) and only delete on confirm?--delete
will happily delete everything under a path. This should probably not be the default behavior!!!Thrown out
A few quick comments based on your notes (but I haven't looked at any code yet):
--delete-test
to --dry-run
since a lot of unix commands already use this convention.SpecFile
ctor, for example) actually update the last access time?bossdata
treats all downloaded files as read-only, I assume that modification time is not useful, but we should check this.$BOSS_DATA_URL
and $BOSS_SAS_PATH
? The obvious answer is yes, but then you might miss a much larger data volume associated with different variables and delete the wrong files.--catalogs
arg, we already have another one (PLATELIST) and will probably be adding more, so you want an API that scales well to more types of catalog.$BOSS_LOCAL_ROOT
. I went back and forth on this, whether it should always start at root, further down under some 'catalog' specific directory, including REDUX or not; settled on the most global option. I wanted to add in restricting to some catalog in a future revision.bosslocalmgr
, but it shouldn't affect any other code.bossdata.conf
and associated config class that everything uses. Right now I'm not doing anything particularly graceful or future-proof.Creation date is a filesystem dependent feature; which makes sense, but isn't something I'd considered, always assuming it to be there.
In any case: on the ext4 filesystem this is available, but not on ext2 or ext3; for these cases, modification date is the closest we can get (date from server is not preserved, at least as far as HTTP downloads go... not sure about globus.) FAT?? and NTFS support this, as well. In any case, the short answer is that -mtime and -mmin options are going to be needed.
Points a file has its date set:
event | ctime* | mtime | atime | Note |
---|---|---|---|---|
Download | 1 | 1 | 1 | |
Access | 0 | 0 | 1 | Only with latest db_manage branch |
Index Update | 0 | 1 | 1 | If we keep this functionality |
ctime*: If supported, obviously
The table makes a good argument for just throwing -ctime
and -cmin
and replacing with -mtime
etc. None of this takes into account how OSX works however.
This issue is to discuss and then implement a new script (called
bosslocal
?) that will scan your$DATA_LOCAL_ROOT
and report things like:Ideally, there would be options to only include files that have (or have not) been accessed in X days, etc, similar to the unix find command.
The script should also have options to do some cleanup of the largest files that have not been accessed recently, etc.