danieliu / play-scraper

A web scraper to retrieve application data from the Google Play Store.
MIT License
232 stars 103 forks source link

Search retrives same apps although page is different #17

Open desconectad0 opened 5 years ago

desconectad0 commented 5 years ago

Hi.

When I try the search function the results are always the same:

pprint(p.search('tinder', page=1)) pprint(p.search('tinder', page=12))

Both calls gives me the same results.

desconectad0 commented 5 years ago

Hi again.

For the purpose of my app a co-worker has found a solution, not for the opened issue.

In the search function in the line where response is assigned: Instead of: response = send_request('POST', self._search_url, data, self.params) I put this: response = send_request('GET', self._search_url, params=self.params, allow_redirects=False)

This solution is enough atm, because retrieves 49 results instead of 20. I don't know if it is possible to get further pages with GET method, so in my knowledge if I want to get more results I should do by POST to get those pages.

cschwem2er commented 5 years ago

Is this bug still occurring with the latest version of play-scraper?

sivaratna commented 5 years ago

Hi,

Firstly, thanks so much to the contributors for the nice play scrapper.

Secondly, I am having the same issue (same results on different pages) with the latest version (just installed today). I am also wondering if it is possible for me to count the number of total search results or to get all results instead of just 20.

Any help or insights would be greatly appreciated. Thanks!

danieliu commented 5 years ago

@sivaratna unfortunately, the endpoint used to search has changed over time from when the scraper was first written. I haven't had the time to look into this and whether a new endpoint exists or is possible to replicate for searching.

The GET method seems to be the best bet currently, although it seems limited in capabilities.

Open to any pull requests.

milcs commented 5 years ago

I've observed the same, had to write a custom Selenium based app search which returns all 250 apps when searching for a keyword. Followed that, using the links I used play-scraper to fetch individual app details. Noticed Play has quite some apps that when browsing for the app link an infamous error 500, internal server error occurs. Was using concurrency of 12 processes, never had a glitch. When going for scraping big number of keywords and processing about a 1M apps, the play would raise unhanded exceptions (no try/catch) on several places getting various app details from the fetch page, had to fix those to directly on the lib code. This is the business of scraping. A scraper gets obsolete in no time.

a-l-e-x-d-s-9 commented 5 years ago

I've just tested the 0.5.5 version, and every page number returns only the first page. I tried to change from POST to GET as @desconectad0 suggested, but it returns nothing at all. Any suggestions or other workarounds?

vibeordie commented 4 years ago

I'm having the same issue with latest version. Could somebody please try to fix this or help us work around it? I get 50 results instead of 20 but page=1 or page=2 will bring exact same list. Apparently there is a pagination token hardcoded for pages 0 to 12 but for some reason this is not working properly.

andodet commented 4 years ago

Problem still happening in 0.6.

As of now store pagination is based on infinite-scrolling: when the bottom of the page is reached, additional content is loaded. I am afraid that piece of javascript has to be triggered by an actual browser (e.g splash).

I can't quite get my mind around the hard-coded tokens in settings.py and what role do they have related to the behaviour described above.

milcs commented 4 years ago

I've ended up writing my own selenium based code to get around the problems. It included the scrolling down to the bottom to trigger loading of additional items. This scraper was to good to be true but then it did not really work.

On Tue, May 19, 2020, 19:38 Andrea Dodet notifications@github.com wrote:

Problem still happening in 0.6. As of now store pagination is based on infinite-scrolling: when the bottom of the page is reached, additional content is loaded. I am afraid that piece of javascript has to be triggered by an actual browser (e.g splash https://splash.readthedocs.io/en/stable/index.html). I can't quite get my mind around the hard-coded tokens in settings.py and what role do they have related to the behaviour described above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/danieliu/play-scraper/issues/17#issuecomment-630973687, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACMH42OQF7OKY7KCIYSC3ADRSK7XVANCNFSM4F7KHCNQ .

andodet commented 4 years ago

Cheers @milcs, that makes perfect sense. I think that isolating the pagination logic most of this library would still be highly relevant, just need to find a bit of time to get my mind around it.