BaseMax / GooglePlayWebServiceAPI

Tiny script to crawl information of a specific application in the Google play/store base on PHP.
MIT License
37 stars 9 forks source link

category return wrong id #22

Closed andaroid closed 2 years ago

andaroid commented 2 years ago

some time category return wrong id "com.whatsapp" return kids category !!

array(6) {
  ["packageName"]=>
  string(12) "com.whatsapp"
  ["name"]=>
  string(18) "WhatsApp Messenger"
  ["developer"]=>
  string(12) "WhatsApp LLC"
  ["category"]=>
  string(4) "Kids"
  ["type"]=>
  string(6) "family"
  ["summary"]=>
  string(26) "Simple. Reliable. Private."
}
BaseMax commented 2 years ago

P.S: Sorry, I am on travel and currently going from Napoli to Rome. I will check it with a delay.

Probably the main official source for this app: https://play.google.com/store/apps/details?id=com.whatsapp&hl=en&gl=US

@IzzySoft Would you please test it? If you found a time.

IzzySoft commented 2 years ago

I was AFK as well and now draining in what stacked up… I've just looked at the page source – it seems that all "visible" (HTML) references for WA (2 are there) point to FAMILY (aka "Kids"). Only the protobuf data has COMMUNICATION:

[[["Communication",[null,null,null,null,[null,null,"/store/apps/category/COMMUNICATION"]],"COMMUNICATION"]]]

So this will only be noticed when checking manually :cry: Looks like we have to switch the source for category to protopuf then (and only fall back to the other source if lookup failed).

Not sure when I'll find time to do that, might take a little. Thanks for reporting, @andaroid – might have taken even longer for us to spot and thus to fix! I'll do my best to fix it as speedy as possible, but cannot promise anything.

andaroid commented 2 years ago

@IzzySoft hi i try fix it by using ld+json , its contain the right category check https://github.com/BaseMax/GooglePlayWebServiceAPI/pull/23

IzzySoft commented 2 years ago

That looks more reliable than my protobuf hacks. I had just set up this:

@ google-play.php:216 @ class GooglePlay {
         if ( empty($values["featureGraphic"]) ) $values["featureGraphic"] = $proto[1][2][96][0][3][2];
         if ( empty($values["video"]) && !empty($proto[1][2][100]) ) $values["video"] = $proto[1][2][100][0][0][3][2];
         if ( empty($values["summary"]) && !empty($proto[1][2][73]) ) $values["summary"] = $proto[1][2][73][0][1]; // 1, 2, 73, 0, 1
+        if ( !empty($proto[1][2][79]) ) {
+          $values["category"] = $proto[1][2][79][0][0][0];
+          switch($proto[1][2][79][0][0][2]) { // category from HTML sometimes is wrong, e.g. "Kids" with WhatsApp (com.whatsapp)
+            case "GAME": $values["type"] = "game"; break;
+            case "FAMILY": $values["type"] = "family"; break;
+            default: $values["type"] = "app"; break;
+          }
+        }
         // screenshots: 1,2,78,0,0-n; 1=format,2=[wid,hei],3.2=url
         // more details see: https://github.com/JoMingyu/google-play-scraper/blob/2caddd098b63736318a7725ff105907f397b9a48/google_play_scraper/constants/element.py
         break;

But protobuf sometimes needs more than 5 reloads to show up. Yours seems to hit it on the first try.

@BaseMax you're OK to go with the solution @andaroid is offering with the mentioned PR? Maybe the formatting should match the way all the other code is formatted (which also would compact it a bit), but then I'd say it's the better approach.

@andaroid maybe you have something similar for featureGraphic, summary and video as well, so we can save us the reloads?

andaroid commented 2 years ago

@IzzySoft ld+json offer this data only , without featureGraphic and video

ld+json data

{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "Enhance Photo Quality",
  "url": "https://play.google.com/store/apps/details/Enhance_Photo_Quality?id=com.smartworld.enhancephotoquality&hl=en&gl=US",
  "description": "App for enlarge image without losing quality, enhance color and photo resolution",
  "operatingSystem": "ANDROID",
  "applicationCategory": "PHOTOGRAPHY",
  "image": "https://play-lh.googleusercontent.com/chvvSlAFzWN16LrHPxO2WAg7LjekVsvgP_BQM9I7nqabiIEQe4hrf8Z8oPPsVSj7uw",
  "contentRating": "Everyone",
  "author": {
    "@type": "Person",
    "name": "Csmartworld"
  },
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.053050518035889",
    "ratingCount": "36957"
  },
  "offers": [
    {
      "@type": "Offer",
      "price": "0",
      "priceCurrency": "USD",
      "availability": "https://schema.org/InStock"
    }
  ]
}

for category id you can use preg_match regex without parse ld+json data ex: $values["category"] = $this->getRegVal('/applicationCategory\"\:\"(?<content>[^"]+)\"/iu'); i think it's better solution for fix category id

IzzySoft commented 2 years ago

ld+json offer this data only , without featureGraphic and video

Yes, that was the only thing I found, too. I was just hoping I had missed something…

andaroid commented 2 years ago

@IzzySoft now what gonna to do ?

IzzySoft commented 2 years ago

@andaroid if you can adjust the formatting to match the project's code, it seems fine for me. Did I understand you correctly that the single line you just posted would do the same as the JSON parsing (your lines 145-150) and we don't want to use other values from the JSON, feel free to rewrite to that. OTOH we could consider taking the other values (author, ratings, price) from the JSON as well. Especially it would be great to include

if ( empty($values["summary"]) ) $values["summary"] = $data["description"];

(which is currently part of the protobuf fallback).

So maybe the best idea is:

All that pending on approval by @BaseMax :wink:

IzzySoft commented 2 years ago

Basically, this is what I think:

@ google-play.php:143 @ class GooglePlay {
     }

     $values["developer"] = strip_tags($this->getRegVal('/href="\/store\/apps\/dev(eloper)*\?id=(?<id>[^\"]+)"([^\>]*|)>(\<span[^\>]*>)*(?<content>[^\<]+)(<\/span>|)<\/a>/i'));

-    preg_match('/<a class="WpHeLc VfPpkd-mRLv6 VfPpkd-RLmnJb" href="\/store\/apps\/category\/(?<id>[^\"]+)" aria-label="(?<content>[^\"]+)"/i', $this->input, $category);
-    if ( empty($category) ) preg_match('/href="\/store\/apps\/category\/(?<id>[^\"]+)" data-disable-idom="true" data-skip-focus-on-activate="false" jsshadow><span class="VfPpkd-N5Lhkf" jsname="bN97Pc"><span class="VfPpkd-jY41G-V67aGc" jsname="V67aGc">(?<content>[^\<]+)<\/span>/i', $this->input, $category);
-    if (isset($category["id"], $category["content"])) {
-      $values["category"] = trim(strip_tags($category["content"]));
-      $catId = trim(strip_tags($category["id"]));
-      if ($catId=='GAME' || substr($catId,0,5)=='GAME_') $values["type"] = "game";
-      elseif ($catId=='FAMILY' || substr($catId,0,7)=='FAMILY?') $values["type"] = "family";
-      else $values["type"] = "app";
-    } else {
-      $values["category"] = null;
-      $values["type"] = null;
-    }

     $values["summary"] = strip_tags($this->getRegVal('/property="og:description" content="(?<content>[^\"]+)/i'));
     $values["description"] = $this->getRegVal('/itemprop="description"[^\>]*><div class="bARER"[^\>]*>(?<content>.*?)<\/div><div class=/i');
     if ( strtolower(substr($lang,0,2)) != 'en' ) { // Google sometimes keeps the EN description additionally, so we need to filter it out **TODO:** check if this still applies (2022-05-27)
@ google-play.php:192 @ class GooglePlay {
     $values["votes"] = $this->getRegVal('/<div class="g1rdde">(?<content>[^>]+) reviews<\/div>/i');
     $values["price"] = $this->getRegVal('/<meta itemprop="price" content="(?<content>[^"]+)">/i');

+    $d = new DomDocument();
+    @$d->loadHTML($this->input);
+    $xp = new domxpath($d);
+    $jsonScripts = $xp->query( '//script[@type="application/ld+json"]' );
+    $json = trim( @$jsonScripts->item(0)->nodeValue ); //
+    $data = json_decode($json,true);

+    if(isset($data['applicationCategory'])) {
+      $values["category"] = $data['applicationCategory'];
+      if(substr($values["category"],0,5)=='GAME_') $values["type"] = "game";
+      elseif(substr($values["category"],0,7)=='FAMILY?') $values["type"] = "family";
+      else $values["type"] = "app";
+    } else {
+      $values["category"] = null;
+      $values["type"] = null;
+    }
+    if ( empty($values["summary"]) && !empty($data["description"]) ) $values["summary"] = $data["description"];

     $limit = 5; $proto = '';
     while ( empty($proto) && $limit > 0 ) { // sometimes protobuf is missing, but present again on subsequent call
       $proto = json_decode($this->getRegVal("/key: 'ds:4'. hash: '7'. data:(?<content>\[\[\[.+?). sideChannel: .*?\);<\/script/ims")); // ds:8 hash:22 would have reviews
@ google-play.php:221 @ class GooglePlay {
         if ( empty($values["video"]) && !empty($proto[1][2][100]) ) $values["video"] = $proto[1][2][100][0][0][3][2];
         if ( empty($values["summary"]) && !empty($proto[1][2][73]) ) $values["summary"] = $proto[1][2][73][0][1]; // 1, 2, 73, 0, 1
         // screenshots: 1,2,78,0,0-n; 1=format,2=[wid,hei],3.2=url
+        // category: $proto[1][2][79][0][0][0]; catId: $proto[1][2][79][0][0][2]
         // more details see: https://github.com/JoMingyu/google-play-scraper/blob/2caddd098b63736318a7725ff105907f397b9a48/google_play_scraper/constants/element.py
         break;
       }

The only draw-back to the protobuf approach is that the category then is all-CAPS, as the JSON has the categoryId. We could work around that by calling parseCategories() and map it accordingly – or simply leave that to the "user". Ouch, after fixing that method that is…

IzzySoft commented 2 years ago

OK, I've fixed parseCategories() (now using a local list of all categories (categories.jsonl, using JSONL format for easy maintenance) as I couldn't find them listed in the original place anymore). Whoever wants the category names instead of the IDs can now obtain the list and map it as needed. The type is defined there as well:

Array
(
    [success] => 1
    [message] => 
    [data] => Array
        (
            [ANDROID_WEAR] => stdClass Object
                (
                    [id] => ANDROID_WEAR
                    [name] => Wear OS by Google
                    [type] => app
                )

            [ART_AND_DESIGN] => stdClass Object
                (
                    [id] => ART_AND_DESIGN
                    [name] => Art & Design
                    [type] => app
                )
 ...
IzzySoft commented 2 years ago

@andaroid so will you perform the above mentioned adjustments?

IzzySoft commented 2 years ago

Thanks once again for pointing out the path, @andaroid! As there was no response from @BaseMax and you didn't do the reorg, I've just pushed it myself. Hope you don't mind; attribution given with the commit :wink:

@BaseMax I've also increased the version number in the header. As the structure returned by parseCategories() is different from what it returned before (now it's an array of objects with category details instead of just a simple array of category IDs), I've increased "minor" (1.0.1 => 1.1.0). The method also no longer needs network traffic (using local definitions as the other list was no longer available) and thus is much faster :smile: For what it returns, see 2 comments above.

Issue should be solved now, so I'm closing it.