GCuser99 / SeleniumVBA

A comprehensive Selenium wrapper for browser automation developed for MS Office VBA running in Windows
MIT License
83 stars 17 forks source link

Issues with GetText and a child FindElementByXPath acting as non-child #64

Closed 6DiegoDiego9 closed 1 year ago

6DiegoDiego9 commented 1 year ago

I can't understand these two issues: 1) if I replace all the "GetInnerHTML" with "GetText", I get empty strings instead than the expected text 2) the (commented out) ".FindElementByXPath" ignores being child of the "extension" element, thus returning a global match (always the first of the page) instead than the match inside the "extension" node

Sub ChatGPTextensionsMonitor()
    'Collects all the Chrome extensions for keyword "ChatGPT"
    Dim browser As Master.WebDriver: Set browser = Master.New_WebDriver
    Dim elem As Master.WebElement, elems As Master.WebElements
    Dim oRegex As RegExp

    'On Error GoTo errHnd
    With browser
        .startedge: .OpenBrowser
        '.SetHightlightFoundElems True
        .NavigateTo "https://chrome.google.com/webstore/search/chatgpt?_category=extensions"
        .SetImplicitlyWait 3000
        Do 'scroll down the infinite scrool up to the end
            Dim oldPageSize&, newPageSize&
            oldPageSize = Len(.GetPageSource)
            .ScrollToElement .FindElementByClassName("a-Hd-mb-Og-Ia") ' (empty) last rectangle at the end of the page
            .Wait 1000
            newPageSize = Len(.GetPageSource)
            DoEvents
        Loop Until Not (newPageSize > oldPageSize + 5) 'tolerance of 5 because I saw it increases by 4 bytes with the scroll after the last one

        Dim collExtensions As Master.WebElements, extension As Master.WebElement
        Set collExtensions = .FindElementsByClassName("a-na-d-K-A-w")
        For Each extension In collExtensions
            With extension
                Dim extrLogo$, extrTitle$, extrDomain$, extrDescr$, extrRatingStars$, extrRatingNumVoters$, extrCategory$
                extrLogo = .FindElementByTagName("img").getattribute("src")
                'extrTitle = .FindElementByXPath("//div[@role='heading']").GetInnerHTML 'bug: it takes the first out of all the page (always the same)
                extrTitle = .FindElementByCssSelector("div[role='heading']").GetInnerHTML 'GetText returns an empty string
                extrDomain = .FindElementByClassName("e-f-y ").GetInnerHTML
                extrDescr = .FindElementByClassName("a-na-d-Oa").GetInnerHTML

                If browser.IsPresent(className, "Y89Uic") Then
                    If oRegex Is Nothing Then
                        Set oRegex = New RegExp
                        Dim TEMPratingFullstring$, matches As MatchCollection
                        oRegex.Pattern = "Average rating (\d(?:\.\d)) out of \d\. +(\d+) users"
                    End If

                    TEMPratingFullstring = .FindElementByClassName("Y89Uic").getattribute("title")
                    Set matches = oRegex.Execute(TEMPratingFullstring)
                    extrRatingStars = matches(0).SubMatches(0)
                    extrRatingNumVoters = matches(0).SubMatches(1)
                End If

                extrCategory = .FindElementByClassName("a-na-d-ea").GetInnerHTML

                Debug.Print extrTitle, extrDomain, extrDescr, extrRatingStars, extrRatingNumVoters 'TEMP
                Stop
            End With
        Next
        .Shutdown
    End With
    Exit Sub
errHnd:
    err.Raise err.Number, err.Source, err.Description, err.HelpFile, err.HelpContext
End Sub

Can you understand where the problem originates? I'm not sure it's a bug in SeleniumVBA because I can't find any in the inner code, at a first glance

GCuser99 commented 1 year ago

On the GetText versus GetInnerHTML issue, I wonder if this is a clue?

Debug.Print .FindElementByCssSelector("div[role='heading']").IsDisplayed 'prints False

See [getText](https://www.selenium.dev/selenium/docs/api/java/org/openqa/selenium/WebElement.html#getText()) - "Get the visible (i.e. not hidden by CSS) text of this element, including sub-elements."

GCuser99 commented 1 year ago

I wonder if this is the way to go...

'WebDriver class
Public Function GetText(element As WebElement, Optional ByVal asRendered As Boolean = True) As String
    If asRendered Then 'default behavior of W3C GetText
        Dim data As New Dictionary
        data.Add "id", element.elementId
        GetText = Execute(tCMD.CMD_GET_ELEMENT_TEXT, data)("value")
    Else 'get the underlying raw text
        GetText = ExecuteScript("return arguments[0].textContent;", element)
    End If
End Function

'in WebElement class:
Public Function GetText(Optional ByVal asRendered As Boolean = True) As String
    GetText = driver_.GetText(Me, asRendered )
End Function

Or maybe just add a GetTextContent method?

GCuser99 commented 1 year ago

... and on the XPath problem, try this (note the period before "//"):

extrTitle = .FindElementByXPath(".//div[@role='heading']").GetInnerHTML

"//" starts search from the document root. ".//" starts search from each extension element ("." meaning "self").

GCuser99 commented 1 year ago

I can see that you have a new hobby! :-) 👍

6DiegoDiego9 commented 1 year ago

[...] Or maybe just add a GetTextContent method?

Ah I didn't know about that official "visible" requirement. Thanks for the info! :)

Although it seems that their implementation (Selenium webdriver? Chromium engine?) is bugged, since my texts are all displayed, I like your solution keeping the "visible" requirement as default and adding the optional argument. Only thing, I'd prefer "includeInvisible As Boolean = False" or "VisibleOnly as Boolean = True", compared to "asRendered As Boolean = True", for both better clarity and conforming to the wording in the official description.


XPath: ah, thanks for this info too! So, please correct me if I'm wrong, it works like this:


Eheh yes I'm trying to get the most out of ChatGPT and I since found that the extensions are growing like mushrooms after a rainstorm, and some are interesting, I need to make the review of them more efficient :)

BTW I suspended thinking about a ChatGPTpromptToClipboard procedure because I found that there are relevant size limits in the prompt, where I planned that this procedure would write the instructions about how to use SeleniumVBA, before the user prompt.

GCuser99 commented 1 year ago

@6DiegoDiego9, I took some time to look at the GetText issue. I don't think it's a bug.

The code line below gives yet another clue:

Debug.Print .FindElementByCssSelector("div[role='heading']").GetCSSProperty("visibility") 'returns "hidden"

If you insert the following line into your original code above, then you will see that the extension element is "visible" and the original GetText works fine:

            With extension
                .ScrollIntoView '<- this one

So the behavior seen in your original code was a function of dynamic CSS. As the extension elements scroll in and out of view, so does the value of their visibility property, which is inherited by an ascendant div element. I also confirmed this by using the dev tools and manually scrolling through the web page while observing the visibility style attributes changing for the child elements of the div with class "h-a-x", under where your information of interest resides.