Closed nhandyal closed 3 years ago
When I log into the container I see some issues
% docker exec -it sparkler-dev /bin/bash
...
/usr/local/bin/greeting.dev.sh: line 3: bad substitution: no closing "`" in ` | '__| |/ / |/ _ \ '__|
____) | |_) | (_| | | | <| | __/ |
|_____/| .__/ \__,_|_| |_|\_\_|\___|_|
| |
|_|
You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041 when spark master is running
Some useful queries:
- Get stats on groups, status, depth:
http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth
Inside docker, you can do the following:
solr - command line tool for administering solr
start -force -> start solr
stop -force -> stop solr
status -force -> get status of solr
restart -force -> restart solr
sparkler - command line interface to sparkler
inject - inject seed urls
crawl - launch a crawl job
build sparkler
cd sparkler-core && mvn install
Logging into the container, building sparkler and restarting Solr all goes well. When I attempt to inject I am getting a fatal error
bash-4.2# sparkler inject -id 1 -su 'https://www.jpl.nasa.gov/'
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2021-02-22 18:19:48 INFO PluginService$:53 - Loading plugins...
2021-02-22 18:19:49 INFO PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-02-22 18:19:49 WARN PluginService$:65 - 6 extra plugin(s) available but not activated: Set(fetcher-chrome, template-plugin, url-injector, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-02-22 18:19:49 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-02-22 18:19:49 INFO PluginService$:73 - Extensions found: [edu.usc.irds.sparkler.plugin.RegexURLFilter@74eb909f]
2021-02-22 18:19:49 INFO PluginService$:76 - Extensions lookup: PluginWrapper [descriptor=PluginDescriptor [pluginId=urlfilter-regex, pluginClass=edu.usc.irds.sparkler.plugin.RegexURLFilterActivator, version=0.2.2-SNAPSHOT, provider=edu.usc.irds.sparkler.plugin, dependencies=[], description=, requires=*, license=null], pluginPath=/data/sparkler-core/bin/../build/plugins/urlfilter-regex-0.2.2-SNAPSHOT.jar].getPluginId
2021-02-22 18:19:49 INFO PluginService$:77 - Extensions id lookup: edu.usc.irds.sparkler.plugin.RegexURLFilter@74eb909f.getClass.getName
2021-02-22 18:19:49 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-02-22 18:19:49 INFO PluginService$:73 - Extensions found: [edu.usc.irds.sparkler.plugin.UrlFilterSameHost@76012793]
2021-02-22 18:19:49 INFO PluginService$:76 - Extensions lookup: PluginWrapper [descriptor=PluginDescriptor [pluginId=urlfilter-samehost, pluginClass=edu.usc.irds.sparkler.plugin.UrlFilterSameHostActivator, version=0.2.2-SNAPSHOT, provider=edu.usc.irds.sparkler.plugin, dependencies=[], description=, requires=*, license=null], pluginPath=/data/sparkler-core/bin/../build/plugins/urlfilter-samehost-0.2.2-SNAPSHOT.jar].getPluginId
2021-02-22 18:19:49 INFO PluginService$:77 - Extensions id lookup: edu.usc.irds.sparkler.plugin.UrlFilterSameHost@76012793.getClass.getName
2021-02-22 18:19:49 INFO PluginService$:82 - Recognised Plugins: Map(urlfilter-regex -> edu.usc.irds.sparkler.plugin.RegexURLFilter, urlfilter-samehost -> edu.usc.irds.sparkler.plugin.UrlFilterSameHost)
2021-02-22 18:19:49 INFO Injector$:110 - Injecting 1 seeds
2021-02-22 18:19:49 WARN SolrProxy:43 - Caught Error from server at http://localhost:8983/solr/crawldb: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/crawldb/update</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
while adding beans, trying to add one by one
Exception in thread "main" java.lang.reflect.InvocationTargetException
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:567)
at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/crawldb/update</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>
</body>
</html>
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:629)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
at edu.usc.irds.sparkler.service.SolrProxy.commitCrawlDb(SolrProxy.scala:62)
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:114)
at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:45)
at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:164)
at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
... 6 more
2021-02-22 18:19:49 WARN PluginService$:49 - Stopping all plugins... Runtime is about to exit.
The problem is that the3 crawldb
core does not exist.
Also in the above commentary you state
cd sparkler-core
docker-compose --file docker-compose.dev.yml up --detach
However when I do this I get bash: docker-compose: command not found
We need to update the documentation to install docker-compose
via the correct package manager inside of the Docker machine.
Updated summary to reflect new changes. Run instructions have changed.
However when I do this I get bash: docker-compose: command not found
Updated the wiki to include a section on installing docker-compose. Also added a message in https://github.com/USCDataScience/sparkler/pull/214/files#diff-7678c21ccfb7190f82e6d92325762b9cbe38d4a894ed92a74a56d70a2a7792cb with a link to the wiki.
The problem is that the3 crawldb core does not exist.
These changes are meant to get sparkler to work with elasticsearch + kibana, therefore solr is not required. As a result we removed solr from the sparkler image for elasticsearch. The inject / crawl commands do not work at the moment because we have not configured sparkler to work with ES.
Added a new page to the project's wiki describing how to start the sparkler + elasticsearch + kibana docker-compose network: https://github.com/USCDataScience/sparkler/wiki/Elasticsearch-Backend
@buggtb @thammegowda any comments?
Quick scan LGTM +1
What changes were proposed in this pull request?
Is this related to an already existing issue on sparkler?
Related to issue 212
How was this patch tested?
development container
mvn clean install
docker service with elasticsearch
observe 3 containers are running. 1 for sparkler, elasticsearch, kibana
docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ed44c41f2a97 elasticsearch_sparkler "/data/start.sh" 23 minutes ago Up 23 minutes 0.0.0.0:4041->4041/tcp sparkler-elastic 37dcb87a9198 docker.elastic.co/kibana/kibana:7.11.1 "/bin/tini -- /usr/l…" 23 minutes ago Up 23 minutes 0.0.0.0:5601->5601/tcp kibana 2d5bd9ee19f4 docker.elastic.co/elasticsearch/elasticsearch:7.11.1 "/bin/tini -- /usr/l…" 23 minutes ago Up 23 minutes 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp elasticsearch
verify the elasticsearch service is reachable from sparkler
log into the sparkler image
docker exec -it sparkler-elastic /bin/bash curl http://elasticsearch:9200/status
verify the elasticsearch service is reachable from localhost
curl http://localhost:9200/status