commercetest / vitals-scraper

Collect data to assist in analysing Google Play Console Android Vitals
MIT License
6 stars 2 forks source link

Allow Scraping all Apps on Account #14

Open ISNIT0 opened 5 years ago

julianharty commented 5 years ago

As a suggestion on the command-line option, how about using:

Alternatively, perhaps a separate command-line argument would make the option clearer and less ambiguous?

For --packageName=* I can see the potential to support a filter for package names e.g. --packageName=*wikimed* however that might be a leap too far in terms of this enhancement.

julianharty commented 5 years ago

Here are some notes and tips to help improve how the code selects the details of the apps from the contents of page: https://play.google.com/apps/publish/?account=<removed>#AppListPlace

In Firefox Console the following command selects hrefs that include an aria-label that has the text package in the label. (I don't know whether this text is translated when Google Play Console is rendered in other languages - if so, this selector may need to be revised accordingly).

$$('a[aria-label*="package"]', 'tbody')

0: <a aria-label="Application Pocket Code:…me org.catrobat.catroid" href="#AppDashboardPlace:p=org…pid=<removed>" data-column="TITLE">​
1: <a aria-label="Application Pocket Paint… org.catrobat.paintroid" href="#AppDashboardPlace:p=org…pid=<removed>" data-column="TITLE">

BTW $$('a[aria-label]', 'tbody') returns 13 matches

​
0: <a class="gwt-Anchor GNVPVGB-C-a GNVPVGB-A-a" href="javascript:" aria-label="Dismiss error message">​
1: <a class="gwt-Anchor" href="javascript:" role="option" aria-label="Hide unpublished apps" aria-selected="false">​
2: <a class="gwt-Anchor" href="javascript:" role="option" aria-label="Hide draft apps" aria-selected="false">​
3: <a aria-label="Application Pocket Code:…me org.catrobat.catroid" href="#AppDashboardPlace:p=org…pid=<removed>" data-column="TITLE">​
4: <a href="#AppDashboardPlace:p=org…pid=<removed>" data-column="INSTALLS" aria-label="Installs on active devices is 37,958">​
5: <a href="#RatingsPlace:p=org.catr…pid=<removed>" data-column="RATINGS" aria-label="Google Play rating is 3.89">​
6: <a href="#AppDashboardPlace:p=org…pid=<removed>" data-column="UPDATE" aria-label="Last update was on Aug 21, 2019">​
7: <a href="#AppDashboardPlace:p=org…pid=<removed>" data-column="STATUS" data-status="PUBLISHED" aria-label="App status is Published">​
8: <a aria-label="Application Pocket Paint… org.catrobat.paintroid" href="#AppDashboardPlace:p=org…pid=<removed>" data-column="TITLE">​
9: <a href="#AppDashboardPlace:p=org…pid=<removed>" data-column="INSTALLS" aria-label="Installs on active devices is 55,415">​
10: <a href="#RatingsPlace:p=org.catr…pid=<removed>" data-column="RATINGS" aria-label="Google Play rating is 3.99">​
11: <a href="#AppDashboardPlace:p=org…pid=<removed>" data-column="UPDATE" aria-label="Last update was on Aug 16, 2019">​
12: <a href="#AppDashboardPlace:p=org…pid=<removed>" data-column="STATUS" data-status="PUBLISHED" aria-label="App status is Published">

$$('a[href*="org"]') returned 12 matches, 11 are also in the above list, the 12th is the mailto link for the developers.

I think I'd like to combine css selectors to filter more precisely the hrefs that have an aria-label and hrefs that contain the package name URL encoded parameter p: that'll also work for any package name (not just ones for 'orgs' i.e. not just those that contain org), etc. I don't yet know how to create and apply a suitable compound selector.

Tips:

julianharty commented 5 years ago

Some more notes related to the implementation: Both text content and the href contain the package name"

I reckon we can make our code more reliable, and a little clearer, by using the information from the href. Also, I've created a new issue #24 where the lack of appid seems to be causing problems in some cases.

julianharty commented 5 years ago

To help with improving the selector, here's my similar code snippet

    public async getListOfApps() {
        const page = await this.claimPage();
        try {
            await page.goto(`https://play.google.com/apps/publish/?account=${this.accountId}#AppListPlace`)
            // Selector used in Firefox console is ('a[aria-label*="package"]', 'tbody')
            const appHrefs = await page.$$eval('[aria-label*="package"]', (as: any[]) => as.map(a => a.href));
            return appHrefs;
        } finally {
            this.releasePage(page);
        }
    }

We'd need to parse the hrefs to extract the package name (and perhaps also the appid).

devmalik7 commented 3 years ago

I can help you...