Test: "brownian motion" through the website

yarikoptic commented 3 years ago

Somewhat inspired by how nicely selenium worked out for me for https://github.com/dandi/dandi-api-webshots (its ugly dirty script is here), I thought it would be nice if there was a test which just randomly wonders around the website (without being authenticated etc) and checks that

no page is tuck for longer than X seconds to load
no page returns non-ok code

I bet there should already be some helper library or tool for that -- anyone knows one?

satra commented 3 years ago

we have a subscription to browserstack automate, which may have some settings to do that.

waxlamp commented 2 years ago

If we implement this, I don't think the wandering should be random (we want reproducibility in all of our CI-based tests). I'm also somewhat uneasy with the idea of a test failing because a page took a while to load: if we set X to near the time we expect a page to load, we're going to fail with false negatives a lot; if we set X to some high, sentinel value, it won't have specificity to tell us that something is truly wrong.

I would feel better about a test that collects load time statistics and plots them in a dashboard so that we could see any performance regressions, but even there I am not too convinced of how much value that would bring to us for the amount of work involved in writing and maintaining the tests.

Can't we get some of this sort of testing from simply timing the webshots?

yarikoptic commented 2 years ago

If we implement this, I don't think the wandering should be random (we want reproducibility in all of our CI-based tests).

an ideal system would create/provide as an artifact a log of random steps it took, so it could be "replayed" if needed for troubleshooting.

if we set X to some high, sentinel value, it won't have specificity to tell us that something is truly wrong.

I think if we set it to e.g. 30 seconds, and we time out -- we know that something is truly wrong! ;) FWIW, atm the longest time for "edit metadata" page loading is 16 sec.

... so that we could see any performance regressions

that would be nice but even more tricky since would require some "benchmarking" dandiset to operate on etc. And indeed would require "maintenance" of such benchmarks etc. FWIW, in principle, someone interested in regressions etc could create plots out of git history of timings recorded in https://github.com/dandi/dandi-api-webshots for dandisets. But it would also include effects of changes to metadata so would not be reliable I would say.

Can't we get some of this sort of testing from simply timing the webshots?

if you mean https://github.com/dandi/dandi-api-webshots -- then the answer is "yes" and we already do that. But that does not cover at all navigation through the files tree (which was found to be needing fixes and RFing a number of times already).

waxlamp commented 2 years ago

If we implement this, I don't think the wandering should be random (we want reproducibility in all of our CI-based tests).

an ideal system would create/provide as an artifact a log of random steps it took, so it could be "replayed" if needed for troubleshooting.

Right, but my concern is more that on a given run of this test for a non-related feature, the randomness might find a problem and block merge of the unrelated feature.

Better would be if this test were scheduled to run nightly so that it's an independent stream of reports from PR-based CI.

if we set X to some high, sentinel value, it won't have specificity to tell us that something is truly wrong.

I think if we set it to e.g. 30 seconds, and we time out -- we know that something is truly wrong! ;) FWIW, atm the longest time for "edit metadata" page loading is 16 sec.

This is definitely preferable to setting it to 5 seconds and getting failures for cases where we actually are ok with something taking 6 seconds.

... so that we could see any performance regressions

that would be nice but even more tricky since would require some "benchmarking" dandiset to operate on etc. And indeed would require "maintenance" of such benchmarks etc. FWIW, in principle, someone interested in regressions etc could create plots out of git history of timings recorded in https://github.com/dandi/dandi-api-webshots for dandisets. But it would also include effects of changes to metadata so would not be reliable I would say.

Yeah, good point.

Can't we get some of this sort of testing from simply timing the webshots?

if you mean https://github.com/dandi/dandi-api-webshots -- then the answer is "yes" and we already do that. But that does not cover at all navigation through the files tree (which was found to be needing fixes and RFing a number of times already).

Ok. Well in the end, I would be ok with implementing these tests, as long as we don't run them on every PR, but rather at a set time each evening or something like that.

yarikoptic commented 2 years ago

oh -- I never envisioned such random walk to be done for each PR, but I guess it could as well be done if we find that generally it is stable -- there could only be benefits IMHO to detect some bugs/deficiencies before they are introduced.

mvandenburgh commented 2 years ago

I think it would be more valuable to write more extensive browser tests, at least as a starting point, before we implement this sort of "random wandering"-style test. There are entire parts of the UI like the meditor and the file browser that aren't tested at all right now (and indeed we have had bugs with the file browser as @yarikoptic mentioned that would likely have been caught earlier if it was properly tested).

yarikoptic commented 2 years ago

Sure, formalized browser tests would be great but as discussed above would be a pretty much separate testing setup which would require work and subsequent maintenance. And I am all for having at least some tests like that. I also always say that it is good to start even with little. But as far as I see it, the purpose of such test suite would be quite different from a "brownian motion" walk, which has a potential to exercise various code paths reachable by a user which unlikely all be even smoke tested by dedicated tests. So, altogether -- two are complimentary. I think it should be easier to start with formalized tests (if someone knows how), but it might be faster to get smoke testing (well, with some time outs catching performance issues) coverage of the larger portion of the codebase. Most likely both would be useful in the long run, and most likely useful "differently".

dandi / dandi-archive

Test: "brownian motion" through the website #771