DigitalSlideArchive / digital_slide_archive

The official deployment of the Digital Slide Archive and HistomicsTK.
https://digitalslidearchive.github.io
Apache License 2.0
108 stars 49 forks source link

Assetstore Import Tracker / Repeater #197

Closed manthey closed 2 years ago

manthey commented 2 years ago

This is a summary of a long-desired feature. Once a repo is created for such a feature, any issues related to it should be moved there (e.g., #193).

We'd like to have a Girder plugin that records when any Import action is done on an assetstore. This would record all of the options: path, destination, etc. for arbitrary assetstore types (probably by hooking the import endpoint event), plus the time that the import started.

We want to show a list of import actions, sorted most-recent first with appropriate details and a button to repeat the import exactly as done before. This list would be accessible from a button somewhere on the assetstore list page and would probably need to be paged. For repeated imports with exactly the same options and assetstore, maybe instead of showing each import as separate line, it would show a "number of times" and the most recent time? In the list, we want to show sensible names, not just girder ids, for collections and folders.

As a bonus, it would be great if when we went to an assetstore import page we showed the last few (10?) imports that were done for that assetstore, so that the user could redo them or see how they wanted to do something differently.

The further feature would be optionally modifying how repeated imports are done: currently if a file doesn't exist in the expected target directory, it is created. We frequently import a directory-tree of files, then organize them in Girder so they are not conceptually in the original directory-tree. Reimporting makes duplicates of all of these files. It would be great if there were an option in import to "skip if file already is in Girder somewhere" -- this can be done by matching the import path. If the file size has changed, we would update the existing file. The more sophisticated method would be to use the computed hash and match on that -- the file might have been renamed either on the assetstore OR in Girder, and, if the hash matches, it would be nice to not have a duplicate. This would be slower, as the hash has to be computed.

It would be nice to have a feature to flag any file in girder that is no longer available on an assetstore. For filesystem assetstores, this would confirm the path is reachable. For S3 assetstores, this would have to confirm the asset is still in the bucket (so would probably be slow). If we did this, we would probably want to show a list of such files (or only such files on a specific assetstore, or only such files from a specific import path) and then have an option to delete associated Girder items (and probably prune empty girder folders, too).

manthey commented 2 years ago

@dgutman Did I miss anything in our desired feature list here? I recognize that you would like a cron-like task to repeat imports at some point. I think we need hash-matching for that to actually do what we want, and I think it is too risky to ever automate deleting missing items. If we ever cron imports, then we should probably cron checking for missing files and report that somewhere (next to the imports list, maybe?) so that the admin can decide what to do.

Ages ago I was involved in a project where we automatically added and removed files from a database when they came and when on NAS-like devices. Devices with intermittent availability (for instance, across any network) made auto removal very risky.

dgutman commented 2 years ago

This is obviously complicated and potentially expensive in terms of walking gigantic filesystems....

I think an option to "hide" images based on inaccessibility may be reasonable... since these images usually still have cached thumbnails and also NFS and other disconnected asset stores can be disconnected for many many reasons, we obviously don't want to just delete these links. In many cases I still have metadata associated with an item that I may want to retrieve, even if the image is not online currently.

Perhaps first thing to do is clean up how the DSA responds when it tries (and fails) to access an image.. it currently throws errors and/or the server becomes generally unhappy. Similarly, it would be good to maybe use some sort of badge/decorator to annotate images that appear to be "disconnected". There's also likely two big differentiators.. a single file going missing may merit a "badge" on that image, since it may suggest a single file was moved/deleted. In the case an entire directory goes "dark", we may want to handle them separately.

Finally, we may want to have the option to "hide" missing images depending on user class. I image in production, it may be useful if the admin and/or collection OWNER sees files that have gone MIA, but we may want to hide those files from other classes of users..

On Tue, Mar 1, 2022 at 9:47 AM David Manthey @.***> wrote:

@dgutman https://github.com/dgutman Did I miss anything in our desired feature list here? I recognize that you would like a cron-like task to repeat imports at some point. I think we need hash-matching for that to actually do what we want, and I think it is too risky to ever automate deleting missing items. If we ever cron imports, then we should probably cron checking for missing files and report that somewhere (next to the imports list, maybe?) so that the admin can decide what to do.

Ages ago I was involved in a project where we automatically added and removed files from a database when they came and when on NAS-like devices. Devices with intermittent availability (for instance, across any network) made auto removal very risky.

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/digital_slide_archive/issues/197#issuecomment-1055522611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFODTT35UXFXEEFEQRFCVTU5YUZVANCNFSM5PUH3BHQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.*** com>

-- David A Gutman, M.D. Ph.D. Associate Professor of Neurology Emory University School of Medicine

manthey commented 2 years ago

The import endpoint supports include/exclude RegEx . We don't expose that in UI (we probably should).

manthey commented 2 years ago

It sounds like when we check for missing files, we would just add some chunk of metadata to the file (and possibly to its parent item) that we could remove again if the file comes back. Then showing missing ones could trivially be done by a virtual folder that matches on that metadata. Since the check for something being present/missing is likely to be stale when we actually try to access something, then any actions we take that expect that flag to be one way or another would have to check again.

Throwing errors when a file is missing is outside the scope of this plugin (and probably differs in the Girder interface versus the HistomicsUI interface). Let's address what we want to do about that in a different issue.

Leengit commented 2 years ago

I don't know enough about the Girder implementation to know whether this is sufficiently relevant, but just in case it is ... rsync handles both the check hash and check file size options, and it can avoid re-transferring something that has been temporarily absent via its --link-dest flag. The command line from a source directory to its newest copy looks in several previous copies via something like:

rsync -a farway:MySource/ 2022-03-01/ --link-dest=2022-02-28/ --link-dest=2022-02-27/ --link-dest=2022-02-26/

Although 2022-03-01/ in this example should start as an empty directory, only the files that have changed will be copied there. The rest of the files will be there too but they will be hard links from one of the correspondingly located files in the directories listed via --link-dest, assuming they match the hash code and timestamp. Additionally, because these links are hard links, we can delete old dates without losing a file that is also present in a more modern date's directory.

N.B. the last I checked, which was about 10 years ago, there was a limit of, maybe, 20 --link-dest directories. Also, I don't recall what the defaults are for rsync checking both the timestamp and the hashcode; it may be necessary to turn on those checks explicitly.

manthey commented 2 years ago

@Leengit We aren't copying anything in this -- we are just indexing files that exist somewhere -- it could be a filesystem or an S3 bucket or a GridFS server, etc. "Import" is an indexing operation, not a copy operation.

jjnesbitt commented 2 years ago

I've begun work on this here: https://github.com/DigitalSlideArchive/import-tracker

manthey commented 2 years ago

@AlmightyYakob We should move the individual parts of this task to issues on https://github.com/DigitalSlideArchive/import-tracker.

manthey commented 2 years ago

I've moved all the details from this issue to separate issues in https://github.com/DigitalSlideArchive/import-tracker, so I'm closing this issue.