MontFerret / ferret

Declarative web scraping
https://www.montferret.dev/
Apache License 2.0
5.74k stars 302 forks source link

[Feature] Hide popups #644

Open suntong opened 3 years ago

suntong commented 3 years ago

Is your feature request related to a problem? Please describe.

Some site have a first level of "popups" that we need to deal with before we can access their contents. Like https://finviz.com/news.ashx, if you visit it using Chrome/Chromium via Chrome Devtools Protocol, and take a screenshot, you will see such "popups".

Describe the solution you'd like

There should be an option to "hide" such "popups". It's doable, and the technology is well known.

This is a screenshot that browshot.com is able to "hide popups", with under Advanced Options, the option to hide popups: https://browshot.com/share/fBn1IghXgzWao2h7T8jk

Describe alternatives you've considered

There is no alternatives to go around such "popups" unless to "hide" them.

Additional context

Can further clarify what I meant by "It's doable, and the technology is well known".

suntong commented 3 years ago

Here is the test code:

LET doc = DOCUMENT('https://finviz.com/news.ashx', {
    driver: 'cdp'
})

WAIT_ELEMENT(doc, '#news > div > table > tbody > tr:nth-child(2) > td:nth-child(1) > table > tbody', 5000)

LET tracks = ELEMENTS(doc, '#news > div > table > tbody > tr:nth-child(2) > td:nth-child(1) > table > tbody')

And the error will be:

"error": "run program: cdp.Runtime: Evaluate: rpc error: Cannot find context with specified id (code = -32000): WAIT_ELEMENT(doc,'#news > div > table > tbody > tr:nth-child(2) > td:nth-child(1) > table > tbody',5000) at 7:0"

ziflex commented 3 years ago

Yeah, this is one of limitations of CDP driver - there is no API that would allow you to attach to opened pop up window. I will address this feature in later releases.

Meanwhile, you may try to extend CDP driver on your own and explore options of how it could be implemented. I guess the problem might be in finding a right window if there are other queries in progress being handled by the same browser instance.