Thoughts on design and implementation choices

julianharty commented 5 years ago

I would like us to make the scripts reliable, robust, and easy to maintain. I would also like to improve the operability and observability of the script so that those who run it can 'see', or check, what's happening and how well the script is interacting with and interpreting the contents of the various pages.

Context

In our current implementation we have ended up using various approaches to get the scraper code to correctly wait for, detect, read, and interact with the GUI. As the GUI does not seem to designed to support software interactions we sometimes end up writing code that interacts with seemingly odd elements. Also, some of the elements we use may appear on other pages, or be in the page structure (DOM) but the overall page might not yet have the relevant content at times e.g. when the site is slow.

Sometimes we will end up picking seemingly odd or irrelevant elements because they were the best we could discern at the time. This doesn't negate the value or intent of the choice we made; however these choices could lead to maintenance challenges especially weeks or months in the future if we need to revisit the code because problems are occuring with the scraping process. Particularly pernicious are changes where the element the script seeks has changed and isn't easy to comprehend once it's changed.

Please help

Here are some initial thoughts on ways we can improve the choices we make and also make the design and implementation easier to maintain as the site's codebase and/or behaviours change. Please add your thoughts, experiences and insights here too.

Note: As some screenshots may include sensitive information please obscure or exclude those details as best practical.

Some suggestions

As a general practice, where we can save a copy of both the appearance and the underlying DOM structures, let's do so. As a suggestion, perhaps a screenshot of the web browser developer tools that includes the relevant section of the web page and the HTML code, etc. would be suitable.

Implementation options:

I'll start with the ideal - if the pages include a clearly identifiable and consistent element locator, let's use it :)
Pick elements locators that are stable and identify pertinent information. Add a note with the relevant source about: why this element seems suitable, what makes it pertinent? when do they seem to appear and when don't they appear? Where practical perhaps link to a copy of the screenshot mentioned above (perhaps they could be checked into this project or added to an issue for the particular page or locator?)
Where the elements we pick seem like they might be transient, or appear elsewhere (e.g. for the same page but for another app's statistics) could we add additional checks to our code to reduce the chances of our code mistakenly trying to interpret the wrong page or content?
If and when we observe anomalies e.g. when pages partially render and then there is a noticable delay before more of the page renders, add some notes, especially where we're using waits, or timers, or relying on timeouts.

Perhaps we could calculate an approximate 'lagginess' of the site and use that value as an input to waits, timers and timeouts, etc.?

Note: I will add some examples to this ticket and may create some additional tickets with further examples.

julianharty commented 5 years ago

Here are 3 screenshots where the crash cluster details page did not load in a timely manner.

Content did not appear for many seconds screenshot 2019-11-03 at 16:46:12

Only selection area appeared so far, two spinners, screenshot 2019-11-03 at 16:45:35

Waiting for crash cluster details screenshot 2019-11-03 at 16:46:49

ISNIT0 commented 5 years ago

Totally agree, this has evolved and does things it wasn't originally intended to do :)

The primary place I think we can improve is waiting for pages to load. I noticed that the pages tend to have a loading indicator at the top. Perhaps a nice generic way to detect if a page is fully loaded is to check that? I had a very quick play with it and didn't see an obvious way to do that, but I can look into it more.

julianharty commented 5 years ago

Here are some thoughts on improving robustness and correctness: Where practical I believe it's healthy for the code to check it's where it's expected to be - on the right page, with the intended content we wish to obtain.

Owing to timing issues and the way the script and/or the puppeteer library interacts with the web app (written in GWT) I noticed that the script processed the crash cluster summary page for the previous app as the page contents weren't cleared while the new contents were loading for the current request. This seems to be an intermittent issue that occred for several days but not otherwise (yet). One way we can improve the robustness and correctness is to include checks for both expected elements / page structures (e.g. there should be a table of results) together with checks for contents e.g. text labels (crash clusters) and the unambiguous name of the desired app e.g. WikiMed.

julianharty commented 5 years ago

For each screen for a given app I expect there will be elements in common (with other screens in the GUI and for the same screen for another app). There will also be differences in the content for that app.

For me, the script would ideally be able to check, recognise and combine various structural and content elements to determine that it's interacting with the correct page that has also loaded sufficiently for the relevant contents to be trustworthy. One possible approach is to sum up the various checks and continue once the script has sufficient confidence. I wrote a short article on a related topic back in 2007 https://www.stickyminds.com/article/improving-accuracy-tests-weighing-results which might help elaborate this concept.

julianharty commented 5 years ago

Gathering evidence is also useful at various stages in the lifecycle of this code:

during the research e.g. as we're starting to implement a feature
during diagnostics: to understand what's being presented in the GUI and to help identify mismatches in assumptions
when there is a mismatch, or failure to recognise a page sufficiently to trust it.

During research and design we can use the evidence (e.g. screenshots, DOM elements, etc) to help identify and select trustworthy elements to use in the code. Collecting contents at runtime may help, particularly if an issue is transcient.

Detecting error messages may also help, if only to indicate the rest of the contents may be flawed/incomplete/wrong and therefore less trustworthy in terms of collecting and later using it. Collecting errors, including error codes, may also help with assessing the reliability and availability of the service we are using.

julianharty commented 5 years ago

For the code to run correctly it can be used for obtaining data for additional apps where we don't know beforehand any details of these apps, it may be necessary for the code to gather some information at runtime to use later on in the script to match details such as the name of the application.

julianharty commented 5 years ago

A useful link: https://www.scrapehero.com/xpaths-and-their-relevance-in-web-scraping/

julianharty commented 5 years ago

And https://www.scrapehero.com/scalable-do-it-yourself-scraping-how-to-build-and-run-scrapers-on-a-large-scale/ (the section on maintenance has some good ideas on ways to configure the scraper to report problems as they are found.

commercetest / vitals-scraper