Track how often files are accessed, and by whom

aekiss commented 3 years ago

Just a thought - for the big 0.1deg runs we save a lot of data on request but it's sometimes unclear how much of it actually gets used or by who, so it's hard to tell whether some of it could be deleted to save space, or whether some diagnostics could be dropped for future runs.

To assist with managing our storage it could be useful to make querying.getvar log which files are actually getting accessed, e.g. by having the database store the total number of requests for each file, and the date of the most recent request. If the username is accessible, the total count and most recent access could be recorded per-user. The DB could then be queried to find big files that nobody needs anymore.

This data could also be useful for documenting the research impact of the cosima datasets, e.g. for grant applications.

angus-g commented 3 years ago

I like this idea! I think maybe it would be better to have the logged accesses stored in a separate database, so we wouldn't have to worry about accidentally nuking the shared (big) database of all experiments. From a technical standpoint, I think sqlite should handle multiple transactional accesses to the same database pretty smoothly, given that the request volumes are pretty small in the grand scheme of things.

aidanheerdegen commented 3 years ago

Permissions is the major issue.

angus-g commented 3 years ago

The stats database could be group-writable, but I guess the worry is people fiddling with it?

aekiss commented 3 years ago

if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?

aidanheerdegen commented 3 years ago

If it can be done I agree it would be great to have.

It does assume that everyone used the cookbook to access the data. I'd like to think that was the case, but sometimes old habits die hard and some people prefer to do it the way they're used to. Be good to know that wasn't the case.

Permissions issues can be around people fiddling, or corrupting the database in some way. Paola has tried something similar with a basic log file for CLeF queries and it has worked, but also can create issues when a new log file is created. That is probably not an issue here if the DB persists at the same path.

aekiss commented 3 years ago

yes there's always the possibility that users will access data by some other method, but at least it will show us what data should definitely be retained...

angus-g commented 3 years ago

I'll note that since we'd be logging to a sqlite DB, it would be quite robust -- write-ahead transactions and such make it atomically handle concurrent writes without corruption, etc. But it is true that it's only useful if people are using the cookbook in the first place!

if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?

If we need to go this route, maybe /tmp-like permissions (u+w g+r) would be enough to stop people accidentally breaking things.

aidanheerdegen commented 3 years ago

If you think it will work with sqlite @angus-g then give it a burl and see. Definitely wrap any DB access in a try/except block though, so failure to log doesn't prevent users from accessing the Cookbook.

access-hive-bot commented 2 years ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/priorities-for-large-msu-experiments/123/18

access-hive-bot commented 1 year ago

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/track-how-recently-files-were-accessed-via-cookbook/391/1

aekiss commented 1 year ago

This came up again at the COSIMA meeting yesterday. It would be a good capability to have. Maybe it's something ACCESS-NRI could help with?

aekiss commented 6 months ago

Having this capability is becoming more ever-more pressing, so we can better manage our growing pile of data. Any thoughts on how we can have it implemented?

Ideally we'd have something that logs all data access via the Cookbook or Intake, storing something like

- file path 1
   - variable 1
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
   - variable 2
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
- file path 2
   - variable 1
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
   - variable 2
      - username 1
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      - username 2
         - total number of requests for this variable in this file by this user
         - date of most recent access of this variable in this file by this user
      ...
...

aidanheerdegen commented 6 months ago

I'd say you need less structure, otherwise you have to query the DB to find existing information, and then update the record.

So something much more like an activity log, and pull out the structured information through queries.

The the question is what actions do you want to log, and what information from that action?

aidanheerdegen commented 6 months ago

Yes this is something ACCESS-NRI wants to do, but you know, time and resources ....

aekiss commented 6 months ago

It would also be helpful if it was possible for cookbook users to query the DB to find out who is using a given dataset. For example

to find out who else is working on related problems (to facilitate collaboration and reduce the chances of project duplication)
so a contributor of shared data can find out who is using their contribution (this could encourage data sharing by being motivating in itself and also relieving fears of having their contribution exploited without their knowledge or suitable credit)

aekiss commented 6 months ago

The the question is what actions do you want to log, and what information from that action?

Log each variable requested in each .nc file accessed via calls to cc.querying.getvar, recording

username
date
.nc file path
variable

This would make enormous log files, which is why I suggested condensing it by storing only the total number of accesses by each user and the date of most recent access.

It would also be nice to be able to look up username to get real name and email address. Not sure if that's possible.

rmholmes commented 6 months ago

Just noting that most of the time I don't use cc.querying.getvar to access data from runs on ik11 etc. I usually just use xr.open_mfdataset on the raw files. I'm not sure how many others are in the same boat.

aekiss commented 6 months ago

Thanks @rmholmes, good point. I suspect most usage is via the cookbook, but we don't actually know.

We can't hope to cover every access method, but any info on usage is better than none - e.g. we can be sure we shouldn't delete data if it's actively used via the cookbook, but if there are no cookbook users we'd need to ask around whether anyone is accessing data via some other method before deleting it.

aekiss commented 5 months ago

For the purposes of cleanup, we can identify files that haven't been recently accessed using the -atime option of find, e.g. https://forum.access-hive.org.au/t/g-data-ik11-cleanup/2153/13, but it would be much more useful to know who accessed them.

COSIMA / cosima-cookbook

Track how often files are accessed, and by whom #231