CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

Re-queuing and access #1947

Closed dloy closed 3 months ago

dloy commented 3 months ago

Zookeeper was introduced to the Access processing to allow a server separation between the creation of large and small content. By analyzing the size of content to be processed the access consumer will direct the processing to a large or small queue for handling.

Startup process:

A separate thread retrieves the queued request and begins the process:

The zookeeper queue in this process is ONLY used for directing content to the proper server (small or large). Locking of the queue is available for preventing content failures during server down time (e.g. deploy).

"re-queuing" in Dryad was available but zookeeper was not involved - the request was resubmitted from the beginning.

The only way re-queuing on zookeeper will work is if no processing of the archive file has started or if the container processing has successfully completed. All intermediate states are not restartable.

For the new zookeeper handling all failures on a zookeeper entry should be treated as a delete .failure. Alternatively, mark all processing of the entry as .success. No effort should be used for restarting using zookeeper. The Dryad approach works well - allow the resubmission of the initial request. This avoids any possible state problems for collision on data content which is a serious issue for S3 content.

dloy commented 3 months ago

Forcing unlock after both success() and fail() conditions using the latest mrt-zk allowed the admin tool to requeue the zookeeper entry.

For this to be successful:

This process does not include: