MontFerret / ferret

Declarative web scraping
https://www.montferret.dev/
Apache License 2.0
5.73k stars 300 forks source link

Google example does not work with version 0.10 #460

Closed gsempe closed 2 years ago

gsempe commented 4 years ago

Describe the bug The google example does not work anymore with the version 0.10

To Reproduce Steps to reproduce the behavior:

  1. Run with version 0.9 the script https://github.com/MontFerret/ferret/blob/master/examples/google-search.fql
  2. The script works as expected
  3. Run with version 0.10 the script https://github.com/MontFerret/ferret/blob/master/examples/google-search.fql
  4. the script prints the error:
Failed to execute the query
cdp.DOM: GetContentQuads: rpc error: Could not compute content quads. (code = -32000): CLICK(google,'input[name="btnK"]') at 8:0

Expected behavior It is expected to have the same behavior.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Chrome is launched with the command line:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
gsempe commented 4 years ago

As a complementary information, it seems the bug happens very rarely on the first crawl after the chrome launch but all the time on the second one

ziflex commented 4 years ago

Does the behavior persist if you use dockerized Chrome? For example, from this image microbox/chromium-headless:77.0.3844.0?

gsempe commented 4 years ago

I just tried with alpeware/chrome-headless-stable:latest and the problem does not happen with this dockerized Chrome version.

From my point I consider it as a good workaround. What do you want to do with this issue from the project point of view?

ziflex commented 4 years ago

Great! Well, the problem is that Chrome is not very stable and there are specific versions that work well. Here you can read how Puppeteer team solved this problem.

I guess I need to add this to README that not all Chrome versions are equally good for Ferret.

gonssal commented 2 years ago

I'm getting this same error with a site. The FQL script is the same as other sites I'm also crawling, the only things that change are the selectors, and I only get the error for this one site.

I'm using the montferret/chromium image in a docker compose setup. I tried updating it and also using it directly without compose, manually executing the script. I always get the error. Any idea on what to do/check to fix it?

ziflex commented 2 years ago

Hmm, is it happening on Google Search page only?

gonssal commented 2 years ago

It's not on Google, it's a real estate site.

ziflex commented 2 years ago

Are you using the latest version of Ferret?

Could you give me an example of your query?

gonssal commented 2 years ago

I get nothing with ferret -version but I ran go get -u and I'm in theory using v1.5.0 of cli. ferret is v0.16.3, updated it to try again just in case.

Here a working query: https://marcgonzález.com/p/ferret/immo-properties-index.fql Here basically the same query failing on another site: https://marcgonzález.com/p/ferret/7claus-properties-index.fql

I added SCROLL_BOTTOM to the second query because I thought the problem might be that some elements appear on scroll, but the error persists.

ziflex commented 2 years ago

I think the problem is that the 'Next' element you are trying to click on is not visible despite using .visible-xs selector. Your selector works only when the size of the screen equals mobile phone size.

Bottom line: change .visible-xs to .hidden-xs.

gonssal commented 2 years ago

I tried modifying it and the error is gone, thank you. I didn't even think you could not click elements with display: none honestly.

Now I get an operation timed out error but I guess it's another issue.