Open piyush302 opened 1 year ago
Did a short test here and it looks like most of the content is there (also in your test.txt). You can use page.asNormalizedText() to get the (visible) content.
@rbri No test.txt only have header and footer content. Main content is missing. And also text.txt only have 3k lines but the response which I got from chrome browser have 25k lines. You can also convert the text file to html extension to visualize the content. Attaching screenshot test.txt view chrome payload.txt view
Its clear from the screenshots that the body content is missing
@piyush302 ah sorry, i had a look at the html headers :-)
Have done a small fix to avoid the error visible in the logs. But now opening the page works without errors. This means there is some triggering of js missing in HtmlUnit. But i fear it will be a really long way to go to figure out what goes wrong here. It will be great if you can do that and i can help you with some tips how to nail down the root of the problem (and of course trying to fix if possible).
What do you think?
@rbri Sure I have few observations .When I tried to get the same page using Selenium Chrome Driver. First time page load gave me the same response(only header and footer) when I reloaded the page I got everything. So I tried doing same with HTMLUNIT but after page.refresh() also it didn't work. After using driver.navigate().refresh(); in Selenium I started getting correct page. I used this code of Selenium
public static void main(String[] args) throws IOException, InterruptedException {
WebDriver driver = null;
WebDriverManager.chromedriver().browserVersion("77.0.3865.40").setup();
ChromeOptions options = new ChromeOptions();
options.addArguments("start-maximized");
options.addArguments("enable-automation");
options.addArguments("--no-sandbox");
options.addArguments("--disable-infobars");
options.addArguments("--disable-dev-shm-usage");
options.addArguments("--disable-browser-side-navigation");
options.addArguments("--disable-gpu");
options.addArguments("--headless");
driver = new ChromeDriver(options);
FileWriter leafNodes = new FileWriter("test.html");
driver.get(
"https://www.amazon.com/s?bbn=13707&rh=n%3A283155%2Cn%3A173507%2Cn%3A173515%2Cn%3A227544%2Cn%3A13707%2Cn%3A13723&dc&qid=1665651390&rnid=13707&ref=lp_13707_nr_n_1");
driver.navigate().refresh();
leafNodes.write(driver.getPageSource());
leafNodes.flush();
System.out.println("done");
}
Dependencies
`
<artifactId>webdrivermanager</artifactId>
<version>5.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.seleniumhq.selenium/selenium-chrome-driver -->
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>4.5.0</version>
</dependency>`
Also was debugging JS in chrome developer tools. I guess function on line no 259 in test.txt is not getting executed properly.
Also please let me know how I can help identify the problem here.
Loading this page https://www.amazon.com/s?bbn=13707&rh=n%3A283155%2Cn%3A173507%2Cn%3A173515%2Cn%3A227544%2Cn%3A13707%2Cn%3A13723&dc&qid=1665651390&rnid=13707&ref=lp_13707_nr_n_1 but the response I am getting is different from what I am getting in Chrome Browser.
My Code `
Response which the code is returning test.txt Expected Response chrome payload.txt
The code response is clearly missing the everything except header and footer content.