catalyst / moodle-tool_objectfs

Object file storage system for Moodle
https://moodle.org/plugins/tool_objectfs
84 stars 67 forks source link

Improve support for "existing Snapshot" use case #443

Open aspark21 opened 2 years ago

aspark21 commented 2 years ago

The Use Cases defined in the readme including use of old files and re-use in multiple environments. This currently works well for prod, a snapshot copy of prod once all the files are already in S3 and all sorts of testing instances which can be given read-only access to S3 to have a fully working instance.

We are looking at applying this to a very similar use case, but which does bring some quirks: existing historical snapshots which are 100% on local disk but for which the files are already in S3 for our prod (i.e. snapshot taken before prod went cloud/S3).

That means it doesn't need to upload files, just go looking to see wether they already exist.

What we'd essentially like to do is: -give the snapshots read-only credentials to the prod S3 bucket -enable objectfs to find the local files which exist in S3 -delete the local copies -don't upload files which are only local to the prod S3 -optionally, upload files which don't exist in the prod S3 into a separate S3 bucket (so we can delete those once we do away with those older snapshots)

We have 6 snapshots, with the same files existing multiple times. In total, currently using 25TB of storage, this would be a massive space/cost saver.

While it feels from the naming of the various tasks that this is not quite how objectfs would behave, it wight actually already be there. Can pick up in WRMS what would need developing/funding to get there.

aspark21 commented 2 years ago

Quick update: We had thought about this previously and now I recall what the concern was: The setting enabletasks needs to be enabled for any of the tasks to run. But by default that runs all of the tasks - the upload and pull ones included. And that doesn't make sense if we don't want to start uploading files to S3 - since they are already local & external or local-only.

Which is why we hadn't set enabletasks on the "a snapshot copy of prod once all the files are already in S3 " use case.

Likewise in this use case as we only want to do check_objects_location initially (and generate_status_report to get an overview report of status) and only once fully checked, start running delete_local_objects.

After some actual digging this time, what we'd do is: 1) Disable all of the objectfs scheduled tasks, in particular: \tool_objectfs\task\push_objects_to_storage \tool_objectfs\task\pull_objects_from_storage \tool_objectfs\task\delete_local_objects \tool_objectfs\task\delete_local_empty_directories \tool_objectfs\task\recover_error_objects

2) Enable Tasks to run tool_objectfs | enabletasks => Yes

3) Run the \tool_objectfs\task\check_objects_location Task Either by having task running when site cron active or by running that task on it's own cron via php admin/cli/scheduled_task.php --execute='\tool_objectfs\task\check_objects_location'

And a daily cron one for php admin/cli/scheduled_task.php --execute='\tool_objectfs\task\generate_status_report'

4) Start local deletions once checks complete Either by having task running when site cron active or by running that task on it's own cron via php admin/cli/scheduled_task.php --execute='\tool_objectfs\task\delete_local_objects'

5) Once shared data cleared from local filedir Leave as is essentially as the deletion task will ensure we only keep requested files temporarily and disk usage doesn't grow back

So I think our core use case for "existing historical snapshots which are 100% on local disk but for which the files are already in S3 for our prod (i.e. snapshot taken before prod went cloud/S3)" is already covered, it just needs configuring appropriately.

If it would be of use, I could try and write that up into the documentation for others to refer to in future.

The one bit that isn't covered is this: -optionally, upload files which don't exist in the prod S3 into a separate S3 bucket (so we can delete those once we do away with those older snapshots)

Which we can probably come back to once we've reduced disk usage with the above.

That kind of made me wonder about a future where files are in one storage and there is a desire to move them into a different storage engine, for whichever reason moving from one cloud to another. The mechanics are there but both of those features would be quite big things to implement.

brendanheywood commented 2 years ago

@aspark21 have you actually testing just leaving all of the crons enabled and just seeing what happens? The code should (I think!) be robust enough to just heal itself and correct all the metadata:

https://github.com/catalyst/moodle-tool_objectfs/blob/MOODLE_310_STABLE/classes/local/store/object_file_system.php#L276-L282

https://github.com/catalyst/moodle-tool_objectfs/blob/MOODLE_310_STABLE/classes/local/store/object_file_system.php#L214-L222

I would expect this to work already, but I would also expect it to take somewhat longer to sort itself out than with a bit of intervention.

That said if it doesn't work, as I was reading the original issue in my head I was coming up with more or less the same steps you came up with in the second comment so that all seems good to me. The main downside I'd say is how long it would take.

Another thing that's worth investigating is a completely separate new process, maybe a task but more likely just a cli, which instead of starting at the moodle db looking for records which it doesn't know about in s3, instead it starts at the source and iterates directly over the s3 bucket listing of file hashes and then it could pick up on the same logic and upsert the objectfs metadata and delete the local files in batches. I'd guess this could be whole ton quicker as you don't really care too much about the pre-state of the moodle db as much only the end state.

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html

From memory we started going down this path for a different use case a while back and then we dropped it.

aspark21 commented 2 years ago

Yes, the \tool_objectfs\task\check_objects_locationTask essentially just runs that get_object_location_from_hash function.

I increased the batch size to 200 000 which takes about 1h to complete. Full site checking took 26h (was running manually through gui as no cron running on those instances so was spread over a few days) for 4.8 million files / 5.4TB which is reasonable . Currently getting cron entries setup to do the other instances and for the deletion task. Once it had checked all of them

I'm just being extra cautious even if I know the S3 creds are read-only.