USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Caught Server refused connection at: http://localhost:8983/solr/crawldb #238

Closed francesco1119 closed 2 years ago

francesco1119 commented 2 years ago

Issue Description

Please describe our issue, along with: Is very easy, I the second command I run on your guide didn't worked

How to reproduce it

I run bash dockler.sh and the result I had was:

root@DS1515:/volume3/Docker_Volume/Sparkler# bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running

Some useful queries:

- Get stats on groups, status, depth:
    http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth

Inside docker, you can do the following:

/data/solr/bin/solr - command line tool for administering solr
    start -force -> start solr
    stop -force -> stop solr
    status -force -> get status of solr
    restart -force -> restart solr

/data/sparkler/bin/sparkler.sh - command line interface to sparkler
   inject - inject seed urls
   crawl - launch a crawl job

As second step I run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' and as result I have:

bash-4.2$ /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-11-27 23:18:42 INFO  PluginService$:53 - Loading plugins...
2021-11-27 23:18:42 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-11-27 23:18:42 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-11-27 23:18:42 INFO  PluginService$:73 - Extensions found: []
2021-11-27 23:18:42 INFO  PluginService$:82 - Recognised Plugins: Map()
2021-11-27 23:18:42 INFO  Injector$:108 - Injecting 1 seeds
2021-11-27 23:18:43 WARN  SolrProxy:93 - Caught Server refused connection at: http://localhost:8983/solr/crawldb while adding beans, trying to add one by one
2021-11-27 23:18:43 WARN  SolrProxy:100 - (SKIPPED) Server refused connection at: http://localhost:8983/solr/crawldb while adding [!!!edu.usc.irds.sparkler.model.Resource@26a529dc=>java.util.IllegalFormatConversionException:f != java.util.HashMap!!!]
2021-11-27 23:18:43 DEBUG SolrProxy:101 - Server refused connection at: http://localhost:8983/solr/crawldb
org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:177) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:285) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at org.apache.solr.client.solrj.SolrClient.addBean(SolrClient.java:267) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        at edu.usc.irds.sparkler.storage.solr.SolrProxy.addResources(SolrProxy.scala:97) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:111) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
        at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
        at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        at java.lang.reflect.Method.invoke(Method.java:567) ~[?:?]
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
        at edu.usc.irds.sparkler.Main.main(Main.scala) [sparkler-app.sparkler-app-0.3.1-SNAPSHOT.jar:0.3.1-SNAPSHOT]
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        ... 19 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.pollConnect(Native Method) ~[?:?]
        at sun.nio.ch.Net.pollConnectNow(Net.java:579) ~[?:?]
        at sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542) ~[?:?]
        at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597) ~[?:?]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339) ~[?:?]
        at java.net.Socket.connect(Socket.java:603) ~[?:?]
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[org.apache.httpcomponents.httpclient-4.5.12.jar:4.5.12]
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564) ~[org.apache.solr.solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:38:26]
        ... 19 more
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://localhost:8983/solr/crawldb
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:672)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
        at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
        at edu.usc.irds.sparkler.storage.solr.SolrProxy.commitCrawlDb(SolrProxy.scala:112)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:112)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:43)
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:162)
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
        ... 6 more
Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
        ... 18 more
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.Net.pollConnect(Native Method)
        at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:579)
        at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
        at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
        at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:339)
        at java.base/java.net.Socket.connect(Socket.java:603)
        at org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:75)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
        ... 28 more
2021-11-27 23:18:43 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

Environment and Version Information

Please indicate relevant versions, including, if relevant:

Server: Engine: Version: 20.10.3 API version: 1.41 (minimum version 1.12) Go version: go1.15.6 Git commit: e7f7c95 Built: Fri Jun 18 08:26:10 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.4.3 GitCommit: b1dc45ec561bd867c4805eee786caab7cc83acae runc: Version: v1.0.0-rc93 GitCommit: 89783e1862a2cc04647ab15b6e88a0af3d66fac3 docker-init: Version: 0.19.0 GitCommit: 12b6a20

An external links for reference

Nah, just tell me if Java and Spark are inside your Docker image or not. If they are not and I have to install them you can close this ticket

Contributing

I'm willing to contribute

thammegowda commented 2 years ago

@francesco1119 Thanks for reaching out. Server refused connection at: http://localhost:8983/solr/crawldb says solr service is not running. Please check/debug why solr is not starting up. If it is running why you are getting this exception

Caused by: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8983 [localhost/127.0.0.1] failed: Connection refused
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:564)
francesco1119 commented 2 years ago

Hi @thammegowda and thank you for your help. In the documentation is not written I have to install solr. In fact I taught it came within the docker image....

I followed your documentation and it says that installing solris an option.

As I'm a new user I can help you out to rewrite your documentation but sincerely I have no idea why solris not starting

thammegowda commented 2 years ago

@francesco1119 the dockler.sh is supposed to start solr service. I just ran it now and I got

bash dockler.sh
Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
[...truncated]
Found image: 7bf3f592ca23
No container is running for 7bf3f592ca23. Starting it...
Starting solr server inside the container
Waiting up to 180 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=61). Happy searching!

In the last part of the output, it starts solr, waits until solr is up before going to the next step. I don't see these messages in your output.

francesco1119 commented 2 years ago

@thammegowda , I tried again.

I run bash dockler.sh and I receive:

Cant find docker image sparkler-local. Going to Fetch it
Fetching uscdatascience/sparkler:latest and tagging as sparkler-local
latest: Pulling from uscdatascience/sparkler
Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb
Status: Image is up to date for uscdatascience/sparkler:latest
docker.io/uscdatascience/sparkler:latest
Found image: 7bf3f592ca23
No container is running for 7bf3f592ca23. Starting it...
Starting solr server inside the container
Waiting up to 180 seconds to see Solr running on port 8983 [-]
Started Solr server on port 8983 (pid=62). Happy searching!

Going to launch the shell inside sparkler's docker container.
You can press CTRL-D to exit.
You can rerun this script to resume.
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041/ when spark master is running

Some useful queries:

- Get stats on groups, status, depth:
    http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth

Inside docker, you can do the following:

/data/solr/bin/solr - command line tool for administering solr
    start -force -> start solr
    stop -force -> stop solr
    status -force -> get status of solr
    restart -force -> restart solr

/data/sparkler/bin/sparkler.sh - command line interface to sparkler
   inject - inject seed urls
   crawl - launch a crawl job

And yes, everything is fine, solris up and running:

image

I then run /data/sparkler/bin/sparkler.sh inject -id 1 -su 'http://www.bbc.com/news' and the command executes correctly. Or at least this is what I believe:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-12-01 19:57:12 INFO  PluginService$:53 - Loading plugins...
2021-12-01 19:57:12 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-12-01 19:57:13 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-12-01 19:57:13 INFO  PluginService$:73 - Extensions found: []
2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-12-01 19:57:13 INFO  PluginService$:73 - Extensions found: []
2021-12-01 19:57:13 INFO  PluginService$:82 - Recognised Plugins: Map()
2021-12-01 19:57:13 INFO  Injector$:108 - Injecting 1 seeds
>>jobId = 1
2021-12-01 19:57:13 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

And when I pass to the very last step with /data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations :

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.spark.spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2021-12-01 19:58:24 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-01 19:58:26 INFO  Crawler$:160 - Setting local job: {User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Sparkler/${project.version}, Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8, Accept-Language=en-US,en}
2021-12-01 19:58:26 INFO  Crawler$:174 - Committing crawldb..
2021-12-01 19:58:26 INFO  Crawler$:219 - Starting the job:1, task:906e00e1-7369-4a64-9593-17fe85d0566a
2021-12-01 19:58:26 INFO  MemexCrawlDbRDD$:54 - selecting 1 out of 1
2021-12-01 19:58:27 DEBUG SolrResultIterator$:63 - Query status:UNFETCHED, Start = 0
2021-12-01 19:58:27 DEBUG SolrResultIterator$:77 - Reached the end of result set
2021-12-01 19:58:27 DEBUG SolrResultIterator$:79 - closing solr client.
2021-12-01 19:58:27 WARN  BlockManager:69 - Block rdd_3_0 could not be removed as it was not found on disk or in memory
2021-12-01 19:58:27 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[org.scala-lang.scala-library-2.12.12.jar:?]
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[org.scala-lang.scala-library-2.12.12.jar:?]
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[org.scala-lang.scala-library-2.12.12.jar:?]
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
2021-12-01 19:58:27 WARN  TaskSetManager:69 - Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)

2021-12-01 19:58:27 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:567)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2152)
        at edu.usc.irds.sparkler.pipeline.Crawler.score(Crawler.scala:254)
        at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$1(Crawler.scala:231)
        at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
        at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:179)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:50)
        at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:338)
        at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
        ... 6 more
Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
        at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
        at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
        at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
        at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
        at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
        at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
        at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
        at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
        at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
        at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:830)

I'm litterary following your documentation

lewismc commented 2 years ago

NoSuchMethod means this is class path related. We need to debug why this is happening. It looks like it was working previously and then over time May have broken. @Francesco we recognize that you are literally following the documentation and we will try to help you as much as possible. Thanks for your patience. I would however encourage you to try and understand why Solr was not starting previously. Maybe an issue with the amount of RAM allowed within your Docker daemon?

Does dockler need to be released against master branch?

On Wed, Dec 1, 2021 at 12:05 Francesco Mantovani @.***> wrote:

@thammegowda https://github.com/thammegowda , I tried again.

I run bash dockler.sh and I receive:

Cant find docker image sparkler-local. Going to Fetch it Fetching uscdatascience/sparkler:latest and tagging as sparkler-local latest: Pulling from uscdatascience/sparkler Digest: sha256:4395aa8e69a220cd3bf52ada94aa6dc2ed3e84919470a007faf9cf80f89308eb Status: Image is up to date for uscdatascience/sparkler:latestdocker.io/uscdatascience/sparkler:latest Found image: 7bf3f592ca23 No container is running for 7bf3f592ca23. Starting it... Starting solr server inside the container Waiting up to 180 seconds to see Solr running on port 8983 [-] Started Solr server on port 8983 (pid=62). Happy searching!

Going to launch the shell inside sparkler's docker container. You can press CTRL-D to exit. You can rerun this script to resume. You can access solr at http://localhost:8983/solr when solr is running You can spark master UI at http://localhost:4041/ when spark master is running

Some useful queries:

Inside docker, you can do the following:

/data/solr/bin/solr - command line tool for administering solr start -force -> start solr stop -force -> stop solr status -force -> get status of solr restart -force -> restart solr

/data/sparkler/bin/sparkler.sh - command line interface to sparkler inject - inject seed urls crawl - launch a crawl job

And yes, everything is fine, solr is up and running:

[image: image] https://user-images.githubusercontent.com/3397477/144305407-292f9e52-9fc4-4471-8f17-a4e110c3d79d.png

I then run /data/sparkler/bin/sparkler.sh inject -id 1 -su ' http://www.bbc.com/news' and the command executes correctly. Or at least this is what I believe:

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2021-12-01 19:57:12 INFO PluginService$:53 - Loading plugins... 2021-12-01 19:57:12 INFO PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost] 2021-12-01 19:57:13 WARN PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit) 2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-regex 2021-12-01 19:57:13 INFO PluginService$:73 - Extensions found: [] 2021-12-01 19:57:13 DEBUG PluginService$:68 - Loading urlfilter-samehost 2021-12-01 19:57:13 INFO PluginService$:73 - Extensions found: [] 2021-12-01 19:57:13 INFO PluginService$:82 - Recognised Plugins: Map() 2021-12-01 19:57:13 INFO Injector$:108 - Injecting 1 seeds

jobId = 1 2021-12-01 19:57:13 WARN PluginService$:49 - Stopping all plugins... Runtime is about to exit.

And when I pass to the very last step with /data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2 # id=1, top 100 URLs, do -i=2 iterations :

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.spark.spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 2021-12-01 19:58:24 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2021-12-01 19:58:26 INFO Crawler$:160 - Setting local job: {User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Sparkler/${project.version}, Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8, Accept-Language=en-US,en} 2021-12-01 19:58:26 INFO Crawler$:174 - Committing crawldb.. 2021-12-01 19:58:26 INFO Crawler$:219 - Starting the job:1, task:906e00e1-7369-4a64-9593-17fe85d0566a 2021-12-01 19:58:26 INFO MemexCrawlDbRDD$:54 - selecting 1 out of 1 2021-12-01 19:58:27 DEBUG SolrResultIterator$:63 - Query status:UNFETCHED, Start = 0 2021-12-01 19:58:27 DEBUG SolrResultIterator$:77 - Reached the end of result set 2021-12-01 19:58:27 DEBUG SolrResultIterator$:79 - closing solr client. 2021-12-01 19:58:27 WARN BlockManager:69 - Block rdd_3_0 could not be removed as it was not found on disk or in memory 2021-12-01 19:58:27 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)' at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[org.scala-lang.scala-library-2.12.12.jar:?] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[org.scala-lang.scala-library-2.12.12.jar:?] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[org.scala-lang.scala-library-2.12.12.jar:?] at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:830) [?:?] 2021-12-01 19:58:27 WARN TaskSetManager:69 - Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)' at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:830)

2021-12-01 19:58:27 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:567) at edu.usc.irds.sparkler.Main$.main(Main.scala:50) at edu.usc.irds.sparkler.Main.main(Main.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 969ed83b7c3d, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)' at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:830)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2152) at edu.usc.irds.sparkler.pipeline.Crawler.score(Crawler.scala:254) at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$1(Crawler.scala:231) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:179) at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34) at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:50) at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:338) at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala) ... 6 more Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)' at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:830)

I'm litterary following your documentation

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/USCDataScience/sparkler/issues/238#issuecomment-984011649, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI4TF3FVN4J3SJ25BD6HOTUOZ5YVANCNFSM5I4TLAYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Lewis Dr. Lewis J. McGibbney Ph.D, B.Sc Skype: lewis.john.mcgibbney

francesco1119 commented 2 years ago

Thank you @lewismc ,

I have installed the latest version of Docker today, this is the only thing that has changed since yesterday. So maybe this change at environment level has trigged something that allowed me to go to the next step.

...we will never know what that was. Sorry I haven't noted on what Docker version I have tested yesterday, it might have been something 6 month old but not more.

I'm watching your repository and I will definitely try your next release as soon as it's out. Can you please conform that you have tried on your end and I experience the same with a fresh new installation.

Otherwise if you can't reproduce I keep investigating.

francesco1119 commented 2 years ago

@lewismc , I see where the problem is:

Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'

Is mentioned on the very first page of your GitHub project:

image

https://github.com/USCDataScience/sparkler/commit/56ad89110aa6e21d1530698ee81c3248a1327e63

            <exclusions>
                <exclusion>
                    <groupId>net.jpountz.lz4</groupId>
                    <artifactId>lz4</artifactId>
                </exclusion>
            </exclusions>

The exclusion of that class was hardcoded

thammegowda commented 2 years ago

I believe this issue is due to Spark and Kafka being incompatible on lz4 dependency; https://stackoverflow.com/a/51052507/1506477 And excluding lz4 from Kafka is the right thing to do (hence exclusion is good!)

However, in docker hub, https://hub.docker.com/repository/docker/uscdatascience/sparkler I see the docker image was last updated 6 months ago, but this exclusion commit is newer. I think rebuilding Docker image and releasing it should fix this https://github.com/USCDataScience/sparkler/wiki/Build-and-Deploy#docker-build

lewismc commented 2 years ago

Correct Thamme. Who can do this? Is there documentation for this?

On Thu, Dec 2, 2021 at 13:20 Thamme Gowda @.***> wrote:

I believe this issue is due to Spark and Kafka being incompatible on lz4 dependency; https://stackoverflow.com/a/51052507/1506477 And excluding lz4 from Kafka is the right thing to do (hence exclusion is good!)

However, in docker hub, https://hub.docker.com/repository/docker/uscdatascience/sparkler I see the docker image was last updated 6 months ago, but this exclusion commit is newer. I think rebuilding Docker image and releasing it should fix this

https://github.com/USCDataScience/sparkler/wiki/Build-and-Deploy#docker-build

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/USCDataScience/sparkler/issues/238#issuecomment-985010113, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI4TFYR3F2XKAXEP2CGT43UO7PLJANCNFSM5I4TLAYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Lewis Dr. Lewis J. McGibbney Ph.D, B.Sc Skype: lewis.john.mcgibbney

francesco1119 commented 2 years ago

Yes @thammegowda , the link you provided has an update dating back to September that says:

Update: This appears to be an issue with Kafka 0.11.x.x and earlier version. As of 1.x.x Kafka seems to have moved away from using the problematic net.jpountz.lz4 library. Therefore, using latest Kafka (1.x) with latest Spark (2.3.x) should not have this issue.

Hence latest Spark with Latest Kafka will probably give no problem.

I look forward to test your new image.

thammegowda commented 2 years ago

@lewismc Docs are here https://github.com/USCDataScience/sparkler/blob/master/Release-Checklist.md I believe @buggtb has been releasing docker images since I left IRDS/JPL.

lewismc commented 2 years ago

@buggtb any chance of you performing a release of the new convenience binaries? Thanks

ravindrabajpai commented 2 years ago

I am also facing this issue. Since the code is already merged with the fix, I tried to build docker image from it and it fails here - Step 8/13 : COPY ./sparkler-ui/sparkler-dashboard/sparkler-ui-*.war /data/solr/server/solr-webapp/sparkler COPY failed: no source files were specified

Can you please share steps to build sparkler-ui. I don't see sparkler-dashboard in the sparkler-ui.

francesco1119 commented 2 years ago

Hi @thammegowda & @lewismc , let me know when you have a stable Docker release and I will test it on my end.

Thank you

thammegowda commented 2 years ago

Hi all

I am having my dissertation defense this month, so totally focused on that. I will have more availability for this project in April (after my dissertation)

@buggtb @karanjeets @chrismattmann any help or suggestions here, sir/bro?

francesco1119 commented 2 years ago

Focus on your dissertation, I'm busy too. Let's keep in touch. Thank you

francesco1119 commented 2 years ago

Hi @thammegowda , how is going? have you had the time to have a look at Sparkler? I haven't experienced the same issue since but I have new problems

francesco1119 commented 2 years ago

Hello @lewismc , the error seems to have changed since last year.

If I execute:

sudo docker run -v elastic:/elasticsearch-7.17.0/data ghcr.io/uscdatascience/sparkler/sparkler:main inject -id myid -su 'http://www.bbc.com/news'

the error now is:

15:52:08.623 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config'
15:52:08.624 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'null'
15:52:08.625 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'fetcher-chrome'
15:52:08.626 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'urlfilter-regex'
15:52:08.627 [main] DEBUG org.pf4j.AbstractExtensionFinder - Loading class 'edu.usc.irds.sparkler.plugin.RegexURLFilter' using class loader 'org.pf4j.PluginClassLoader@158a8276'
15:52:08.637 [main] DEBUG org.pf4j.AbstractExtensionFinder - Checking extension type 'edu.usc.irds.sparkler.plugin.RegexURLFilter'
15:52:08.639 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.639 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'databricks-api'
15:52:08.640 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'fetcher-htmlunit'
15:52:08.641 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'url-injector'
15:52:08.642 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'urlfilter-samehost'
15:52:08.643 [main] DEBUG org.pf4j.AbstractExtensionFinder - Loading class 'edu.usc.irds.sparkler.plugin.UrlFilterSameHost' using class loader 'org.pf4j.PluginClassLoader@5fbe4146'
15:52:08.644 [main] DEBUG org.pf4j.AbstractExtensionFinder - Checking extension type 'edu.usc.irds.sparkler.plugin.UrlFilterSameHost'
15:52:08.645 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.646 [main] DEBUG org.pf4j.AbstractExtensionFinder - Finding extensions of extension point 'edu.usc.irds.sparkler.Config' for plugin 'scorer-dd-svn'
15:52:08.647 [main] DEBUG org.pf4j.AbstractExtensionFinder - No extensions found for extension point 'edu.usc.irds.sparkler.Config'
15:52:08.822 [main] INFO edu.usc.irds.sparkler.service.Injector$ - Injecting 1 seeds
15:52:12.990 [main] DEBUG org.apache.http.impl.nio.client.MainClientExec - [exchange: 1] start execution
15:52:13.007 [main] DEBUG org.apache.http.client.protocol.RequestAddCookies - CookieSpec selected: default
15:52:13.038 [main] DEBUG org.apache.http.client.protocol.RequestAuthCache - Re-using cached 'basic' auth scheme for http://localhost:9200
15:52:13.040 [main] DEBUG org.apache.http.client.protocol.RequestAuthCache - No credentials for preemptive authentication
15:52:13.041 [main] DEBUG org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1] Request connection for {}->http://localhost:9200
15:52:13.045 [main] DEBUG org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection request: [route: {}->http://localhost:9200][total kept alive: 0; route allocated: 0 of 10; total allocated: 0 of 30]
15:52:13.088 [pool-2-thread-1] DEBUG org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager - Connection request failed
java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.base/java.lang.Thread.run(Thread.java:829)
15:52:13.089 [pool-2-thread-1] DEBUG org.apache.http.impl.nio.client.InternalHttpAsyncClient - [exchange: 1] connection request failed
15:52:13.092 [pool-2-thread-1] DEBUG org.elasticsearch.client.RestClient - request [GET http://localhost:9200/] failed
java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.base/java.lang.Thread.run(Thread.java:829)
15:52:13.095 [pool-2-thread-1] DEBUG org.elasticsearch.client.RestClient - added [[host=http://localhost:9200]] to blacklist
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at edu.usc.irds.sparkler.Main$.main(Main.scala:71)
        at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: ElasticsearchException[java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused]; nested: ExecutionException[java.net.ConnectException: Connection refused]; nested: ConnectException[Connection refused];
        at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2695)
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:2171)
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:2137)
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:2105)
        at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:1241)
        at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.$anonfun$commitCrawlDb$1(ElasticsearchProxy.scala:175)
        at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.$anonfun$commitCrawlDb$1$adapted(ElasticsearchProxy.scala:172)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at edu.usc.irds.sparkler.storage.elasticsearch.ElasticsearchProxy.commitCrawlDb(ElasticsearchProxy.scala:172)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:137)
        at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
        at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
        at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:39)
        at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:188)
        at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
        ... 6 more
Caused by: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257)
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244)
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75)
        at org.elasticsearch.client.RestHighLevelClient.performClientRequest(RestHighLevelClient.java:2692)
        ... 22 more
Caused by: java.net.ConnectException: Connection refused
        at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.base/java.lang.Thread.run(Thread.java:829)

I also have a doubt, on Docker I find 2 different repositories:

which is which?

francesco1119 commented 2 years ago

It's fixed now