Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
175 stars 68 forks source link

Using Chrome Driver with Norconex inside Docker container #844

Open milos-slalom opened 11 months ago

milos-slalom commented 11 months ago

I have an application that is using Norconex version 3.0.2. For javascript sites I am using the WebDriverHttpFetcher and setting the browser to Chrome. When running locally, everything works fine.

However when running inside a Docker container, I get errors starting the Chrome driver with the following being the most relevant:

[1690595716.740][SEVERE]: bind() failed: Cannot assign requested address (99)
Caused by: org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 500. Message: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)

When Googling these issues most comments refer to the need to set the --no-sandbox option on the Chrome driver.

I tested this out by creating some selenium code directly in my app and running that code inside the container. Without the --no-sandbox the code fails and with the option it works.

Looking at the Norconex documentation, there is no way to set this option. I ended up downloading the Norconex source code and adding one line in the following file: /collector-http/src/main/java/com/norconex/collector/http/fetch/impl/webdriver/Browser.java

public enum Browser {

    CHROME() {
        @Override
        WebDriverSupplier driverSupplier(
                WebDriverLocation location,
                Consumer<MutableCapabilities> optionsConsumer) {
            ChromeOptions options = new ChromeOptions();
            options.setHeadless(true);
            **options.addArguments("--no-sandbox");**
            ofNullable(location.getBrowserPath()).ifPresent(
                    p -> options.setBinary(p.toFile()));
            optionsConsumer.accept(options);
            return new WebDriverSupplier(new WebDriverBuilder()
                .driverClass(ChromeDriver.class)
                .driverSystemProperty(CHROME_DRIVER_EXE_PROPERTY)
                .location(location)
                .options(options));
        }
    },

I compiled a new version of the collector and included the new version in my code and then everything runs fine inside the container.

I wanted to raise this issue for your awareness, see if you have any plans to add this as an option to pass into the WebDriverHttpFetcher and if not whether you would consider doing so.

Thanks.

ohtwadi commented 11 months ago

Thank you for your contribution. We will look at including this in a future release.

stale[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

XjSv commented 1 month ago

@ohtwadi This is still not included in the latest version. I think the issue might have been closed by mistake.