HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
878 stars 172 forks source link

How to fetch instagram profile page with HtmlUnit #625

Open dulocart opened 1 year ago

dulocart commented 1 year ago

I am trying to get Instagram profile page without authentication to check if the user have profile image. Appear that when you load the page is have some splash screen and you can see the page source. Any help how to deal with this, and get the page source to check if the user have profile picture.

My settings are:

webClient.getOptions().setJavaScriptEnabled(false); // when I enable it is have exception
webClient.getOptions().setCssEnabled(false);

webClient.addRequestHeader("User-Agent".....)

HtmlPage page = webClient.getPage("https://www.instagram.com/profile_username/?hl=us");
            Thread.sleep(10000);
            page.refresh();
rbri commented 1 year ago

@intuitonlabs Sorry, but it looks like the current level of js support (based on https://github.com/mozilla/rhino) seems to be not capable to load this page full of fancy js stuff (no content at all, everything generated by js).

We are working hard to improve this but i can't promise to have to working in the near future. (any help is welcome).

dulocart commented 1 year ago

Okay, that make sens this page is heavy loaded by JavaScript. If there any solution to load it partially I will be happy, I need only profile picture dom element to check if the user have profile image.

dulocart commented 1 year ago

@rbri Is Rhino in development, I see the last release was 2 years ago.I am not so familiar with JavaScript to help here. Thank you for the answer and if this is done in future will be great, at this time I will search for another solution.I don't want to use Selenium for this as spend lot of resources to load the webpage as bandwidth and CPU.

rbri commented 1 year ago

@intuitonlabs The story with Rhino is a bit more complex. HtmlUnit uses a customized (and relabled) version of the current rhino sources.

the flow is usually

In general rhino has a real slow release cycle, but we are always using the head version (and sometimes we are already merging pr's not merged into rhino head so far (e.g. https://github.com/mozilla/rhino/pull/1332).

Hope that clarifies the situation a bit.