daijro / hrequests

🚀 Web scraping for humans
https://daijro.gitbook.io/hrequests/
Apache License 2.0
581 stars 36 forks source link

Content is not fully loaded #40

Open mir3u opened 4 months ago

mir3u commented 4 months ago

I am testing this library with browser automation on some websites and I have observed that for many of them the content that is lazy is not fully loading (images, js scripts that might load the page). I was wandering why might cause this issue.

daijro commented 4 months ago

This issue happens when rendering a Response into a BrowserSession:

resp = session.get('https://www.somewebsite.com/')
page = resp.render()

When setting the content of a page, Playwright doesn't seem to automatically load images and scripts in the website until the page is interacted with. A temporary solution for now is to reload the page immediately after rendering.

I am currently working on a fix for this issue in v0.9.0, which will launch an intermediate locally hosted server for Playwright that will serve the contents of the page, and hopefully allow it to fully render.

Foxtrod89 commented 1 month ago

This issue happens when rendering a Response into a BrowserSession:

resp = session.get('https://www.somewebsite.com/')
page = resp.render()

When setting the content of a page, Playwright doesn't seem to automatically load images and scripts in the website until the page is interacted with. A temporary solution for now is to reload the page immediately after rendering.

I am currently working on a fix for this issue in v0.9.0, which will launch an intermediate locally hosted server for Playwright that will serve the contents of the page, and hopefully allow it to fully render.

I'm getting same problem with javascript is not fully loaded. I was trying to play with context manager and getting something like <noscript>You need to enable JavaScript to run this app.</noscript>

with hrequests.BrowserSession(browser='chrome', headless=False) as session:
    response = session.get('https://egov.uscis.gov/')
    print(response.content)