Open aekiss opened 3 years ago
I like this idea! I think maybe it would be better to have the logged accesses stored in a separate database, so we wouldn't have to worry about accidentally nuking the shared (big) database of all experiments. From a technical standpoint, I think sqlite should handle multiple transactional accesses to the same database pretty smoothly, given that the request volumes are pretty small in the grand scheme of things.
Permissions is the major issue.
The stats database could be group-writable, but I guess the worry is people fiddling with it?
if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?
If it can be done I agree it would be great to have.
It does assume that everyone used the cookbook to access the data. I'd like to think that was the case, but sometimes old habits die hard and some people prefer to do it the way they're used to. Be good to know that wasn't the case.
Permissions issues can be around people fiddling, or corrupting the database in some way. Paola has tried something similar with a basic log file for CLeF queries and it has worked, but also can create issues when a new log file is created. That is probably not an issue here if the DB persists at the same path.
yes there's always the possibility that users will access data by some other method, but at least it will show us what data should definitely be retained...
I'll note that since we'd be logging to a sqlite DB, it would be quite robust -- write-ahead transactions and such make it atomically handle concurrent writes without corruption, etc. But it is true that it's only useful if people are using the cookbook in the first place!
if we want to track this per-user we could create a separate stats db for each user as needed, and just give them write permissions for their own one?
If we need to go this route, maybe /tmp
-like permissions (u+w g+r
) would be enough to stop people accidentally breaking things.
If you think it will work with sqlite @angus-g then give it a burl and see. Definitely wrap any DB access in a try
/except
block though, so failure to log doesn't prevent users from accessing the Cookbook.
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/priorities-for-large-msu-experiments/123/18
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:
https://forum.access-hive.org.au/t/track-how-recently-files-were-accessed-via-cookbook/391/1
This came up again at the COSIMA meeting yesterday. It would be a good capability to have. Maybe it's something ACCESS-NRI could help with?
Having this capability is becoming more ever-more pressing, so we can better manage our growing pile of data. Any thoughts on how we can have it implemented?
Ideally we'd have something that logs all data access via the Cookbook or Intake, storing something like
- file path 1
- variable 1
- username 1
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
- username 2
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
...
- variable 2
- username 1
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
- username 2
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
...
- file path 2
- variable 1
- username 1
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
- username 2
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
...
- variable 2
- username 1
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
- username 2
- total number of requests for this variable in this file by this user
- date of most recent access of this variable in this file by this user
...
...
I'd say you need less structure, otherwise you have to query the DB to find existing information, and then update the record.
So something much more like an activity log, and pull out the structured information through queries.
The the question is what actions do you want to log, and what information from that action?
Yes this is something ACCESS-NRI wants to do, but you know, time and resources ....
It would also be helpful if it was possible for cookbook users to query the DB to find out who is using a given dataset. For example
The the question is what actions do you want to log, and what information from that action?
Log each variable requested in each .nc file accessed via calls to cc.querying.getvar
, recording
This would make enormous log files, which is why I suggested condensing it by storing only the total number of accesses by each user and the date of most recent access.
It would also be nice to be able to look up username to get real name and email address. Not sure if that's possible.
Just noting that most of the time I don't use cc.querying.getvar
to access data from runs on ik11
etc. I usually just use xr.open_mfdataset
on the raw files. I'm not sure how many others are in the same boat.
Thanks @rmholmes, good point. I suspect most usage is via the cookbook, but we don't actually know.
We can't hope to cover every access method, but any info on usage is better than none - e.g. we can be sure we shouldn't delete data if it's actively used via the cookbook, but if there are no cookbook users we'd need to ask around whether anyone is accessing data via some other method before deleting it.
For the purposes of cleanup, we can identify files that haven't been recently accessed using the -atime
option of find
, e.g. https://forum.access-hive.org.au/t/g-data-ik11-cleanup/2153/13, but it would be much more useful to know who accessed them.
Just a thought - for the big 0.1deg runs we save a lot of data on request but it's sometimes unclear how much of it actually gets used or by who, so it's hard to tell whether some of it could be deleted to save space, or whether some diagnostics could be dropped for future runs.
To assist with managing our storage it could be useful to make
querying.getvar
log which files are actually getting accessed, e.g. by having the database store the total number of requests for each file, and the date of the most recent request. If the username is accessible, the total count and most recent access could be recorded per-user. The DB could then be queried to find big files that nobody needs anymore.This data could also be useful for documenting the research impact of the cosima datasets, e.g. for grant applications.