SeleniumHQ / htmlunit-driver

WebDriver compatible driver for HtmlUnit headless browser.
Apache License 2.0
255 stars 86 forks source link

How to setting proxy authorization with username/passwd in ubuntu-server 18.04 env #132

Open luorixiangyang opened 1 year ago

luorixiangyang commented 1 year ago

How to setting proxy authorization with username/passwd in ubuntu-server 18.04 env? I found lots of example but dont reslove my requirement to scrape the web like : (https://developer.apple.com/documentation/accelerate/bnns/shape/3656199-init)

thanks!

rbri commented 1 year ago

Looks like there is something missing ;-) will have a deeper look

rbri commented 1 year ago

As a workaround can you please try something like

String PROXY_HOST = ....;
int PROXY_PORT = .....

WebDriver webDriver = new HtmlUnitDriver(BrowserVersion.FIREFOX, true) {
    @Override
    protected WebClient modifyWebClient(WebClient client) {
        final WebClient webClient = super.modifyWebClient(client);

        webClient.getOptions().setProxyConfig(new ProxyConfig(PROXY_HOST, PROXY_PORT, null));
        final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
        credentialsProvider.addCredentials("username", "password", PROXY_HOST, PROXY_PORT);

       return webClient;
    }
};
luorixiangyang commented 1 year ago

As a workaround can you please try something like

String PROXY_HOST = ....;
int PROXY_PORT = .....

WebDriver webDriver = new HtmlUnitDriver(BrowserVersion.FIREFOX, true) {
    @Override
    protected WebClient modifyWebClient(WebClient client) {
        final WebClient webClient = super.modifyWebClient(client);

        webClient.getOptions().setProxyConfig(new ProxyConfig(PROXY_HOST, PROXY_PORT, null));
        final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient.getCredentialsProvider();
        credentialsProvider.addCredentials("username", "password", PROXY_HOST, PROXY_PORT);

       return webClient;
    }
};

Here is the detail infos: pom.xml dependency like below: ...

org.seleniumhq.selenium selenium-java 4.10.0
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>htmlunit-driver</artifactId>
        <version>4.10.0</version>
    </dependency>

...

source code like below: public static WebDriver createProxyWebDriver() { String PROXY_HOST = ProxyHost; int PROXY_PORT = ProxyPort;

    // config webDriver with proxies
    WebDriver webDriver = new HtmlUnitDriver(BrowserVersion.FIREFOX, true) {
        @Override
        protected WebClient modifyWebClient(WebClient client) {
            final WebClient webClient = super.modifyWebClient(client);

            webClient.getOptions().setProxyConfig(new ProxyConfig(PROXY_HOST, PROXY_PORT, null));
            final DefaultCredentialsProvider credentialsProvider = (DefaultCredentialsProvider) webClient
                    .getCredentialsProvider();
            credentialsProvider.addCredentials(ProxyUser, ProxyPass, PROXY_HOST, PROXY_PORT, null);

            return webClient;
        }
    };
    return webDriver;
}

public static String getPageOnDynamicWeb(String url) { WebDriver client = createProxyWebDriver(); client.get(url); String response = client.getPageSource(); client.close(); return response; }

public static void main(String[] args) throws Exception { String response = ""; String url = "https://developer.apple.com/documentation/accelerate/bnns/shape/3656199-init"; // the target url response = getPageOnDynamicWeb(url); ClearInnerToWriteFile( "/home/luori/_fly/workspaces/javaworkspace/selenium-base/logs/apple_api_page_html.html", response); }

Run before code will take exception like below: ...... Caused by: net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: invalid property id (https://developer.apple.com/tutorials/js/chunk-vendors.fc64ed7e.js#10) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory$HtmlUnitErrorReporter.error(HtmlUnitContextFactory.java:435) at net.sourceforge.htmlunit.corejs.javascript.Parser.addError(Parser.java:257) at net.sourceforge.htmlunit.corejs.javascript.Parser.reportError(Parser.java:336) at net.sourceforge.htmlunit.corejs.javascript.Parser.reportError(Parser.java:327) at net.sourceforge.htmlunit.corejs.javascript.Parser.reportError(Parser.java:320) at net.sourceforge.htmlunit.corejs.javascript.Parser.objectLiteral(Parser.java:3499) ......

luorixiangyang commented 1 year ago

Environment: Ubuntu-server 18.04 google-chrome: Google Chrome 114.0.5735.133 ChromeDriver:114.0.5735.90 JDK:1.8.0_271

luorixiangyang commented 1 year ago

Please try target url :https://developer.apple.com/documentation/accelerate/bnns/shape/3656199-init to test the correct approach .
Thanks!

luorixiangyang commented 1 year ago

I need to point out :(https://developer.apple.com/documentation/accelerate/bnns/shape/3656199-init) is dynamic web content, need excute javascript file on scrape process. I can get the static web content but can't catch the dynamic parts.

rbri commented 1 year ago

Had a deeper look and there are several problems with this page. Long story short - HtmlUnit does not support the whole modern javascript syntay (because it is based on Rhino). We are working on improving this but i fear there is no real progress until the end of this year.

Two options: help us to improve Rhino or use selenium with real browsers

luorixiangyang commented 1 year ago

Had a deeper look and there are several problems with this page. Long story short - HtmlUnit does not support the whole modern javascript syntay (because it is based on Rhino). We are working on improving this but i fear there is no real progress until the end of this year.

Two options: help us to improve Rhino or use selenium with real browsers

Got it! I also check if i can make contribution on HtmlUitl to improve this issue.