TeamHG-Memex / undercrawler

A generic crawler
78 stars 25 forks source link

download out-of-domain iframes #40

Open kmike opened 8 years ago

kmike commented 8 years ago

When iframe is in a page it makes sense to get its content even if it is not in allowed domain. Maybe we shouldn't follow links in this case though.

lopuhin commented 8 years ago

Yeah, unless the iframe links point to the parent domain. I think ideally, the iframe should be included into the parent page contents in this case.

kmike commented 8 years ago

Yeah, I agree. There is QWebSettings::FrameFlatteningEnabled option (http://qutebrowser.org/tmp/qtdoc-linktitle/qwebsettings.html), maybe it could work for Splash. Alternatively, there is an API to go into iframes in QtWebKit (but not in upcoming QtWebEngine); is is already used by render.json endpoint, and we can create a Lua API for that if FrameFlatteningEnabled doesn't work.

kmike commented 8 years ago

Frame flattening option doesn't seem to work - I've tried it here https://github.com/scrapinghub/splash/commit/8eb45d89d1d798695d2f17e92f69647c15dca27a, splash:html() doesn't include html content of iframes.