JiscSD / rdss-archivematica-automation-tools

Archivematica automation tools
1 stars 0 forks source link

synchronisation of automation tools and copying of datasets by NextCloud #5

Open mjaddis opened 6 years ago

mjaddis commented 6 years ago

If a user copies a dataset into the automated transfer folder, i.e. the one that is being 'watched' by the automation tools, and if the copy takes a long time to complete (minutes or more) then the automation tools can end up initiating a Transfer into Archivematica before the dataset is complete. The user has no idea that only part of their dataset has been preserved. This is bad.

It would be much better if, for example, the automation tools checked that there had been no file activity for say the last 10 mins in a candidate transfer folder. Only if all the files were closed and hadn't been changed/modified for a while then the automation tools would start the transfer. This isn't a bullet proof solution, but it would be a lot better than how the tools work at the moment.

mjaddis commented 6 years ago

Corresponding JIRA issue: https://jiscdev.atlassian.net/browse/RDSSARK-267

jeremysparks86 commented 6 years ago

3 possible solutions:

Add a Pre-Transfer area instead so that datasets can be moved in all at once initiating an inode change.

Add some kind of IF statement to the transfer-script.sh file so that if will only run the Python script if a condition is met about the data which has been recently deposited there. This could be doable.

Do some enhancement work to the automation tools libraries themselves to make it more clever. This would need to be re-assigned as I'm not a Python developer.

sevein commented 6 years ago

This is an interesting feature available in Nextcloud: https://docs.nextcloud.com/server/10/developer_manual/app/hooks.html. Presumably hooks could be used to run some code when the copy is complete (using postCopy (\OCP\Files\Node $source, \OCP\Files\Node $target).

mamedin commented 6 years ago

Other possible solution:

We can use a new cifs shared (nfs-automated) that the nextcloud service should use for the automated directory. This cifs shared will force the creation of directories owned by a different user/group other than archivematica (for example maml) with permissions 0770.

[nfs-automated]
comment = Automated_Data
path = /mnt/nfs/automated
read only = No
guest account = maml
guest ok = Yes
force user = maml
force group = maml
usershare allow guests = yes
writeable   = yes
browseable = no
create mask = 0660
directory mask = 0770
force create mode = 0660
force directory mode = 0770

The script below can be run as a cronjob on the nfs server, and this will fix the ownership when this directory has no files in use:

#!/bin/bash
set -x
while IFS= read -r -d '' dir; do
    if [ $(lsof +D "${dir}" | wc -l) -eq 0 ]; then
                sudo chown -R archivematica:archivematica "${dir}"
        fi
done < <(find /mnt/nfs/automated/ -maxdepth 1 -mindepth 1 -type d -user maml -print0)
mjaddis commented 6 years ago

In the JIsc deployments, I think the NFS is set up by Jisc and is shared with Willow/Haplo or anything else that needs persistent storage. I'm not sure we have the ability to install software on the NFS server including scripts or cronjobs. We'd need to check with Jisc.

We also need to have in mind what would happen if Jisc moved to AWS EFS (only reason we're not using it already is because it's not currently available in AWS London) or Jisc moved from AWS to UKCloud (scheduled for this quarter).

Therefore, a better solution in my view is to either modify the automation tools / pre-transfer scripts to check for no recent file activity in a candidate transfer (my original suggestion) or to use hooks in NextCloud or a NextCloud plugin (as discussed on the call yesterday and noted above by @sevein).