Closed OkkeKlein closed 3 years ago
The solution in #732 does not appear to work for me here. I am missing the FAQs on your page. I told it to wait for elements of your FAQs such as h6
or app-faq
to no success. It is as if the FAQs are not loaded by the web driver. I noticed the timeouts specified are not respected either. I will have to investigate it more. Let me know if you find something on your end in the meantime.
Before I can try anything with the FAQ I need to have setup working with Chrome. Or is it only the wait that needs to be added?
Setting timeout values is not always working as expected depending on the web driver used. I managed to find a workaround by having the crawler itself wait for a few seconds while the web driver/browser is rendering. I was able to get your FAQs that way. This can hopefully be a viable solution for #732 and other pages with timing issues. I just released a new snapshot where you can add this to your WebDriverHttpFetcher
configuration section:
<threadWait>2 seconds</threadWait>
2 seconds was enough for me.
I managed to get FAQ with firefox and threadWait. However when trying to use Google I never get it to work. Still using the versions mentioned in first comment.
Not sure why it does not work for you with chrome. I am able to successfully crawl it with the exact same chrome driver and browser versions. I tried on Windows. Does it work for you on Windows? I wonder if you only experience this on a specific OS.
Do you get any errors? What do you get?
On Windows I can crawl with Chrome no problem.
On Linux i get using
`
<threadWait>2 seconds</threadWait>
10:14:02.445 [Norconex Minimum Test Page/1] INFO CRAWLER_RUN_THREAD_BEGIN - Thread[Norconex Minimum Test Page/1,5,main]
10:14:02.447 [Norconex Minimum Test Page/1] INFO Browser - Creating local "ChromeDriver" web driver.
10:14:02.448 [Norconex Minimum Test Page/2] INFO CRAWLER_RUN_THREAD_BEGIN - Thread[Norconex Minimum Test Page/2,5,main]
Starting ChromeDriver 88.0.4324.96 (68dba2d8a0b149a1d3afac56fa74648032bcf46b-refs/branch-heads/4324@{#1784}) on port 16098
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
10:14:03.344 [Norconex Minimum Test Page/2] INFO Browser - Creating local "ChromeDriver" web driver.
10:14:03.344 [Norconex Minimum Test Page/1] ERROR Crawler - Problem in thread execution.
com.norconex.collector.core.CollectorException: Could not build web driver
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.build(Browser.java:237) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverSupplier.get(Browser.java:181) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.WebDriverHolder.getDriver(WebDriverHolder.java:74) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher.fetcherThreadBegin(WebDriverHttpFetcher.java:242) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.AbstractHttpFetcher.accept(AbstractHttpFetcher.java:127) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.AbstractHttpFetcher.accept(AbstractHttpFetcher.java:76) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.commons.lang.event.EventManager.doFire(EventManager.java:136) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.commons.lang.event.EventManager.fire(EventManager.java:117) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.commons.lang.event.EventManager.fire(EventManager.java:111) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:992) [norconex-collector-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_275]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_275]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_275]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_275]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:182) ~[commons-lang3-3.11.jar:3.11]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:149) ~[commons-lang3-3.11.jar:3.11]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.lambda$build$0(Browser.java:232) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.commons.lang.SystemUtil.callWithProperty(SystemUtil.java:118) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.build(Browser.java:222) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
... 12 more
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesnt exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: 3.141.59 , revision: e82be7d358 , time: 2018-11-14T08:17:03
System info: host: bmc-dev , ip: 127.0.1.1 , os.name: Linux , os.arch: amd64 , os.version: 5.4.0-51-generic , java.version: 1.8.0_275
Driver info: driver.version: ChromeDriver
remote stacktrace: #0 0x559216d4c199
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_275]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_275]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_275]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_275]
at org.openqa.selenium.remote.W3CHandshakeResponse.lambda$errorHandler$0(W3CHandshakeResponse.java:62) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.HandshakeResponse.lambda$getResponseFunction$0(HandshakeResponse.java:30) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.ProtocolHandshake.lambda$createSession$0(ProtocolHandshake.java:126) ~[selenium-remote-driver-3.141.59.jar:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_275]
at java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958) ~[?:1.8.0_275]
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_275]
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_275]
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531) ~[?:1.8.0_275]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:128) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:74) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:136) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:552) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:213) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:131) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) ~[selenium-chrome-driver-3.141.59.jar:?]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) ~[selenium-chrome-driver-3.141.59.jar:?]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157) ~[selenium-chrome-driver-3.141.59.jar:?]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_275]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_275]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_275]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_275]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:182) ~[commons-lang3-3.11.jar:3.11]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:149) ~[commons-lang3-3.11.jar:3.11]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.lambda$build$0(Browser.java:232) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.commons.lang.SystemUtil.callWithProperty(SystemUtil.java:118) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.build(Browser.java:222) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
... 12 more
10:14:03.370 [Norconex Minimum Test Page/1] INFO CRAWLER_RUN_THREAD_END - Thread[Norconex Minimum Test Page/1,5,main]
10:14:03.371 [Norconex Minimum Test Page/1] INFO WebDriverHttpFetcher - Shutting down CHROME web driver.
Starting ChromeDriver 88.0.4324.96 (68dba2d8a0b149a1d3afac56fa74648032bcf46b-refs/branch-heads/4324@{#1784}) on port 18035
Only local connections are allowed.
Please see https://chromedriver.chromium.org/security-considerations for suggestions on keeping ChromeDriver safe.
ChromeDriver was started successfully.
10:14:03.504 [Norconex Minimum Test Page/2] ERROR Crawler - Problem in thread execution.
com.norconex.collector.core.CollectorException: Could not build web driver
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.build(Browser.java:237) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverSupplier.get(Browser.java:181) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.WebDriverHolder.getDriver(WebDriverHolder.java:74) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.WebDriverHttpFetcher.fetcherThreadBegin(WebDriverHttpFetcher.java:242) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.AbstractHttpFetcher.accept(AbstractHttpFetcher.java:127) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.AbstractHttpFetcher.accept(AbstractHttpFetcher.java:76) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.commons.lang.event.EventManager.doFire(EventManager.java:136) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.commons.lang.event.EventManager.fire(EventManager.java:117) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.commons.lang.event.EventManager.fire(EventManager.java:111) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.collector.core.crawler.Crawler$ProcessReferencesRunnable.run(Crawler.java:992) [norconex-collector-core-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_275]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_275]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_275]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_275]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:182) ~[commons-lang3-3.11.jar:3.11]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:149) ~[commons-lang3-3.11.jar:3.11]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.lambda$build$0(Browser.java:232) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.commons.lang.SystemUtil.callWithProperty(SystemUtil.java:118) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.build(Browser.java:222) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
... 12 more
Caused by: org.openqa.selenium.WebDriverException: unknown error: Chrome failed to start: exited abnormally.
(unknown error: DevToolsActivePort file doesn t exist)
(The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Build info: version: 3.141.59 , revision: e82be7d358 , time: 2018-11-14T08:17:03
System info: host: bmc-dev , ip: 127.0.1.1 , os.name: Linux , os.arch: amd64 , os.version: 5.4.0-51-generic , java.version: 1.8.0_275
Driver info: driver.version: ChromeDriver
remote stacktrace: #0 0x5604e949b199
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_275]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_275]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_275]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_275]
at org.openqa.selenium.remote.W3CHandshakeResponse.lambda$errorHandler$0(W3CHandshakeResponse.java:62) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.HandshakeResponse.lambda$getResponseFunction$0(HandshakeResponse.java:30) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.ProtocolHandshake.lambda$createSession$0(ProtocolHandshake.java:126) ~[selenium-remote-driver-3.141.59.jar:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_275]
at java.util.Spliterators$ArraySpliterator.tryAdvance(Spliterators.java:958) ~[?:1.8.0_275]
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:499) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:486) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_275]
at java.util.stream.FindOps$FindOp.evaluateSequential(FindOps.java:152) ~[?:1.8.0_275]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_275]
at java.util.stream.ReferencePipeline.findFirst(ReferencePipeline.java:531) ~[?:1.8.0_275]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:128) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.ProtocolHandshake.createSession(ProtocolHandshake.java:74) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:136) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:552) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.RemoteWebDriver.startSession(RemoteWebDriver.java:213) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:131) ~[selenium-remote-driver-3.141.59.jar:?]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:181) ~[selenium-chrome-driver-3.141.59.jar:?]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:168) ~[selenium-chrome-driver-3.141.59.jar:?]
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157) ~[selenium-chrome-driver-3.141.59.jar:?]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[?:1.8.0_275]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[?:1.8.0_275]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[?:1.8.0_275]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[?:1.8.0_275]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:182) ~[commons-lang3-3.11.jar:3.11]
at org.apache.commons.lang3.reflect.ConstructorUtils.invokeExactConstructor(ConstructorUtils.java:149) ~[commons-lang3-3.11.jar:3.11]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.lambda$build$0(Browser.java:232) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at com.norconex.commons.lang.SystemUtil.callWithProperty(SystemUtil.java:118) ~[norconex-commons-lang-2.0.0-SNAPSHOT.jar:2.0.0-SNAPSHOT]
at com.norconex.collector.http.fetch.impl.webdriver.Browser$WebDriverBuilder.build(Browser.java:222) ~[norconex-collector-http-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
... 12 more
`
Thank you for your logs.
Are you running as root? If so, try with a regular user. Also, try with the full paths to chrome and the chrome driver.
This appears to be the culprit:
(unknown error: DevToolsActivePort file doesn't exist)
I did a bit of research and it appears to be a frequent problem with chrome on Linux. E.g.: https://stackoverflow.com/questions/50790733/unknown-error-devtoolsactiveport-file-doesnt-exist-error-while-executing-selen/50791503
I suggest you try a few of the suggested fixes you get from researching that error online. If some involve passing options via the webcrawling XML configuration, you can do so with:
<capabilities>
<capability name="(capability name)">(capability value)</capability>
<!-- multiple "capability" tags allowed -->
</capabilities>
I tried everything I could think of. A bit hard to see what's happening as there is not much logging. Not sure even if the capabilities were passed.
With Firefox as a working alternative I'm gonna put a pin in this one.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I managed to get Firefox running but to see if some issues in Firefox would be resolved when using Google Chrome, I could not get it working
Google Chrome 88.0.4324.96 and webdriver
example URL https://www.voordeelvloeren.nl/faq/onderwerp/top-10