Problem: time spent running validation tasks is too high

sevein commented 5 years ago

We've observed in one of our performance testing environments that the accumulated times spent in validation tasks (in both Transfer and Ingest stages) rank the highest. E.g. using the Prometheus query language, topk(10, sort_desc(sum(mcpserver_task_duration_seconds_sum{}) by (task_name))) returns:

Task name	Time spent (seconds)
Validate preservation derivatives	682631.1435899988
Validate formats	539526.269936003
Normalize for preservation	496881.06493000267
Identify file format	301271.4550159982
Characterize and extract metadata	187699.32913900024
Scan for viruses	121811.49169099994
Identify file format of metadata files	76674.5008929998
Assign checksums and file sizes to objects	54747.83528499984
Transcribe	52017.63158299977
Assign file UUIDs to objects	46284.298254999914

Presumably, the problem is magnified when we're processing a large number of files and the validation tool used is JHOVE, because of the overhead in the JVM startup time.

We had the same problem in characterization via FITS which Misty fixed in Archivematica v1.2 by using Nailgun: https://projects.artefactual.com/issues/5509.

Your environment (version of Archivematica, OS version, etc) qa/1.x

For Artefactual use: Please make sure these steps are taken before moving this issue from Review to Done:

All PRs related to this issue are properly linked 👍
All PRs related to this issue have been merged 👍
Test plan for this issue has been implemented and passed 👍
Documentation regarding this issue has been written and it has been added to the release notes, if needed 👍

sallain commented 5 years ago

@sevein something I've been thinking about lately is how to capture performance information here in these issues so that it can be tested against later. My thinking is, if this results in a PR, and then in 6 months that work is in a release, how do we replicate your original test so that we confidently say how much performance has improved?

I don't really have any concrete ideas on this - just dropping it in your ear. If you have thoughts on capturing environment metrics, I'd love to hear them!

sevein commented 5 years ago

That seems like a good idea. I'm going to CC @cole because he's been doing the instrumentation work.

ross-spencer commented 5 years ago

Hi @sallain I noticed the conversation about this in Slack. But this seems like it might be an okay place to capture some low-tech approaches, so avoiding fine-tuning across client-sites etc. there are a few things that shouldn't hold us back testing performance improvements in releases with our current processes.

I think the first important thing to register is perhaps that our performance changes are usually just in code, so the difference between two commits, or a series of commits.

I think this came up in #315 but we could probably have done more to support how it was tested and/or taken on that responsibility for testing. If the problem was noticed with stable/1.9 in commit 1234567 and fixed in qa/1.x in commit abcdefa then if we can provision a VM for the qa machine at that commit, we can also switch back to commit 1234567 pretty easily on the same machine, to get the comparison result.

As I don't believe, in general, we're trying to optimize Archivematica to the second or more granular, we can understand the impact of the change through wall clock measurements. We should be able to see significant enough results in the timing for the whole transfer, as well as being able to drill down into specific jobs via the ui to get that data.

Testing like this Is for the most part what we're our docker environments. (Though Cole has been working on adding to our abilities to do more). But we can switch between commits quite easily on our test sites too. We may just need to look at refining that for those testing during a release.

What was quite good in #315 is that there was some sample-data, so I think if we can identify a control set in most cases, and if it's public, then that permits us to do a lot more in a common way too. Apache Tika uses a regression corpus called common crawl, but interesting, also govdocs, for which Open Preserve host a version too. It's also used by Siegfried.

If our current sample data doesn't cover this use case, we might be able to add to our corpus? And then likely hook it into our existing tests on Jenkins.

I'd need to check with Santi about recording the historical output of the tests on Jenkins, The history of different builds are kept but I don't know how to get into the detail below where you'll see there's a potential to harness something like AMAUAT which are a good example of somewhere we get timing:

Tag=@black-box
5 features passed, 0 failed, 20 skipped
10 scenarios passed, 0 failed, 52 skipped
61 steps passed, 0 failed, 558 skipped, 0 undefined
Took 9m28.664s

For issues like the above, if we can identify a control dataset, we can create a performance test for Archivematica. We could tag it @performance, or @slow-validation etc., and then run it in Jenkins on newer, and older commits. And with an identically provisioned VM, we'll be able to see and record the differences.

This will be something that gets easier to use the AMAUAT tests for now with the release process we'll be working towards now from #778.

We should definitely be tracking progressively when Archivematica is getting faster or slower with new commits. This will be especially important when we introduce new microservices into the workflow. We will of course be monitoring for when there are integration tests between commits as well - at least, we hope to get there with the new AMAUAT release process.

Additionally, but maybe less useful, we can get a little bit of timing information out of our new API endpoints (the task here is clean-up names):

http --pretty=format \
    GET "http://127.0.0.1:62080/api/v2beta/task/9dd75fb0-c6b0-4339-82a8-60beaae58dc9/" \
    Authorization:"ApiKey test:test"

Results in:

{
    "duration": 24, <--- this is in seconds currently
    "exit_code": 0,
    "file_name": "4000-0e7ee51c-a9fc-4679-91d3-9b1814a046ea",
    "file_uuid": "",
    "time_created": "2019-08-18T16:12:20",
    "time_ended": "2019-08-18T16:12:44",
    "time_started": "2019-08-18T16:12:20",
    "uuid": "9dd75fb0-c6b0-4339-82a8-60beaae58dc9"
}

But because of the way microservices are organized, some are perhaps too granular to be much use yet. Validation on 46000 files, will yeild 46000 data structures like this to pull the information from, which is why it is still easier to do from the database.

The data is still there for users to create tooling for but also, while these endpoints are in the beta phase, we still have lots of opportunity to improve them.

There's perhaps a lack of scientific precision in the results we might see from the above but my understanding is that we're looking for optimizations in the region of minutes and hours, vs. seconds, so I feel these tests, and wall-clock timings are still useful. Especially in regression tests with known data. But yeah, a lot of this is possible because we're only making changes in very high-level code, or potentially (rarely) deploying different versions of tools. There might be a six month difference in commits, but we can always switch back to that previous commit on a control environment today (same VM, same OS, same interpreter) to recreate the original test.

It seems too that maybe there's a small gap in tooling which may help us to switch easily on the same VM today? or perhaps a gap in accessible control data? If we have that we can use it using some (hopefully) pretty accessible processes to measure the change in performance over time by looking backwards from today, (At least, as I anticipate, for 'statistically' significant changes in performance).

sevein commented 5 years ago

An alternative to Nailgun could be to run JHOVE only once for the whole batch, but requiring significant changes in FPR. See this comment for more.

archivematica / Issues

Problem: time spent running validation tasks is too high #864