artefactual / archivematica

Free and open-source digital preservation system designed to maintain standards-based, long-term access to collections of digital objects.
http://www.archivematica.org
GNU Affero General Public License v3.0
428 stars 103 forks source link

Problem: start transfer API endpoint is blocking #1139

Closed sevein closed 5 years ago

sevein commented 6 years ago

In RDSS we started the development of a new/api/v2beta/package/ API endpoint.

Pull request @ JiscRDSS: https://github.com/JiscRDSS/archivematica/pull/66. Also, https://github.com/JiscRDSS/archivematica/pull/73 - where we reverted the changes in the transfer browser widget due to an incompatibility with Shibboleth that needs to be investigated further.

joel-simpson commented 6 years ago

I think the new beta package endpoint is a great addition for performance and reliability. While it had to be reverted in the Jisc production environment due to shibboleth issues, I did test it quite a bit in the Jisc QA environment.

I also think that the asynchronous endpoint to create transfers may be necessary / useful for the work on rate limiting that we are currently looking at for the Jisc project (which is currently only documented on the wiki, but if we decide to proceed with it, I will create a public issue to document the concerns and rationale for that work).

sevein commented 6 years ago

There was an early attempt in #936 but I'm going to be submitting a new PR.

sromkey commented 6 years ago

@sevein Could you please restate the problem in a way that describes how this effects users? My understanding is that the connected PR removes the user approval step, non-optionally? I think this warrants discussion. (For the record I'm not opposed to the change, but it is a pretty big one to go through without further user discussion).

joel-simpson commented 6 years ago

This problem arose during the Jisc project (so the description of the problem and the initial analysis was done elsewhere)... I will attempt to describe the problem better:

Problem: Very large datasets can't be successfully transferred into Archivematica. When selecting a very large directory (> 100s of GBs) using the Transfer tab of the Dashboard, the Transfers were not successfully started, and the user receives no feedback that the operation has failed. We encountered this using a fork of Archivematica 1.6 deployed using Docker containers in an AWS environment.

Analysis of root cause: When a directory is selected in the Transfer tab and the user selects "Start Transfer" the files from the Transfer Source Location selected are copied to the 'currently processing' location. This process relies on calls to the Archivematica and Storage Service APIs, which are synchronous. This means that those APIs "block" further work until their operations are complete.

In the Jisc deployment there is a constraint that all API calls must complete within 1 hour. (In AWS, when an ALB (Application Load Balancer) is used, the maximum timeout setting is 1 hour.) If a dataset took longer than 1 hour to transfer the process was being killed (by the ALB). (Worth also mentioning that file transfers can take much longer in a deployment like this because the deployment is distributed - moving files is not always being done purely on local machine storage).

So the core problem is really a scalability issue. The secondary problem is that users do not receive useful feedback in the UI (or anywhere) during long running processes or in the case failures.

@sromkey also asked about removing the approval step. The short answer is that with the solution being put in place, that step is (I believe) now redundant. But you are right, that does warrant discussion. This is more about the design of the solution (since there was nothing "wrong" or problematic with the Approve Transfer step per se), so I'll put more comments into the PR #1191 to expand on the rationale for that particular optimisation.

joel-simpson commented 6 years ago

@sevein also documented some of his analysis of current state behaviour in this gist.

joel-simpson commented 6 years ago

I am testing this in our prometheus QA environment (using qa/1.x as at 24 July 2018). The PRs have been through code review so I'm not reviewing actual calls to the API endpoints.

I am conducting regression testing to confirm there are no new defects (starting lots of transfers). So far behaviour looks as expected. I will also be conducting performance testing with very large transfers to confirm they work as expected (in progress).

The one functional change to test here is the new "auto_approve" checkbox.

screenshot from 2018-07-25 20-10-21

This is working as expected. When a new transfer is started with the checkbox enabled, transfers will skip the "Approve Transfer" job and move straight to the next job (which varies depending on the transfer type). When the checkbox is not enabled, transfers are started and then wait for user input at the "Approve Transfer" decision point. This gives users an opportunity to review the transfer directly, wherever it is stored.

Once the performance tests are completed I'll move this to 'verified'.

jraddaoui commented 6 years ago

As @joel-simpson said, the check-box works as expected and I can't see a reason for this having a negative performance impact if it's just skipping a job.