HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
863 stars 170 forks source link

Getting different HTML page than in Chrome Browser #508

Open piyush302 opened 1 year ago

piyush302 commented 1 year ago

Loading this page https://www.amazon.com/s?bbn=13707&rh=n%3A283155%2Cn%3A173507%2Cn%3A173515%2Cn%3A227544%2Cn%3A13707%2Cn%3A13723&dc&qid=1665651390&rnid=13707&ref=lp_13707_nr_n_1 but the response I am getting is different from what I am getting in Chrome Browser.

My Code `

    WebClient client = new WebClient(BrowserVersion.CHROME);
    String baseUrl = "https://www.amazon.com/s?bbn=13707&rh=n%3A283155%2Cn%3A173507%2Cn%3A173515%2Cn%3A227544%2Cn%3A13707%2Cn%3A13723&dc&qid=1665651390&rnid=13707&ref=lp_13707_nr_n_1";

    client.getOptions().setThrowExceptionOnScriptError(false);
    client.addRequestHeader(baseUrl, baseUrl);
    client.setAjaxController(new NicelyResynchronizingAjaxController());

    try {
        FileWriter test = new FileWriter("test.html");
        HtmlPage page = client.getPage(baseUrl);
        client.waitForBackgroundJavaScript(5000);
        test.write(page.asXml());
        test.flush();
        System.out.println("done");
    } catch (Exception e) {
        e.printStackTrace();
    }

`

Response which the code is returning test.txt Expected Response chrome payload.txt

The code response is clearly missing the everything except header and footer content.

rbri commented 1 year ago

Did a short test here and it looks like most of the content is there (also in your test.txt). You can use page.asNormalizedText() to get the (visible) content.

piyush302 commented 1 year ago

@rbri No test.txt only have header and footer content. Main content is missing. And also text.txt only have 3k lines but the response which I got from chrome browser have 25k lines. You can also convert the text file to html extension to visualize the content. Attaching screenshot test.txt view Screenshot (16) chrome payload.txt view Screenshot (17)

Its clear from the screenshots that the body content is missing

rbri commented 1 year ago

@piyush302 ah sorry, i had a look at the html headers :-)

Have done a small fix to avoid the error visible in the logs. But now opening the page works without errors. This means there is some triggering of js missing in HtmlUnit. But i fear it will be a really long way to go to figure out what goes wrong here. It will be great if you can do that and i can help you with some tips how to nail down the root of the problem (and of course trying to fix if possible).

What do you think?

piyush302 commented 1 year ago

@rbri Sure I have few observations .When I tried to get the same page using Selenium Chrome Driver. First time page load gave me the same response(only header and footer) when I reloaded the page I got everything. So I tried doing same with HTMLUNIT but after page.refresh() also it didn't work. After using driver.navigate().refresh(); in Selenium I started getting correct page. I used this code of Selenium

public static void main(String[] args) throws IOException, InterruptedException {
    WebDriver driver = null;
    WebDriverManager.chromedriver().browserVersion("77.0.3865.40").setup();
    ChromeOptions options = new ChromeOptions();
    options.addArguments("start-maximized");
    options.addArguments("enable-automation");
    options.addArguments("--no-sandbox");
    options.addArguments("--disable-infobars");
    options.addArguments("--disable-dev-shm-usage");
    options.addArguments("--disable-browser-side-navigation");
    options.addArguments("--disable-gpu");
    options.addArguments("--headless");
    driver = new ChromeDriver(options);
    FileWriter leafNodes = new FileWriter("test.html");

    driver.get(
            "https://www.amazon.com/s?bbn=13707&rh=n%3A283155%2Cn%3A173507%2Cn%3A173515%2Cn%3A227544%2Cn%3A13707%2Cn%3A13723&dc&qid=1665651390&rnid=13707&ref=lp_13707_nr_n_1");
    driver.navigate().refresh();
    leafNodes.write(driver.getPageSource());
    leafNodes.flush();
    System.out.println("done");

}

Dependencies `

io.github.bonigarcia
        <artifactId>webdrivermanager</artifactId>
        <version>5.3.0</version>
    </dependency>
    <!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-chrome-driver -->
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-chrome-driver</artifactId>
        <version>4.5.0</version>
    </dependency>`

Also was debugging JS in chrome developer tools. I guess function on line no 259 in test.txt is not getting executed properly.

Also please let me know how I can help identify the problem here.