USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Elasticsearch for Sparkler - Containerization Logic #214

Closed nhandyal closed 3 years ago

nhandyal commented 3 years ago

What changes were proposed in this pull request?

  1. Creates a development container with all toolchains required for contributing to Sparkler. The alleviates the necessity for developers to configure their local machines appropriately. The dev container is configured to mount your local file system into the container to prevent the need to rebuild the image each time source files change.
  2. Creates a docker service for sparkler + elasticsearch + kibana. The sparkler service uses the development container introduced above.

Is this related to an already existing issue on sparkler?
Related to issue 212

How was this patch tested?

development container

sparkler-core/sparkler-deployment/docker/elasticsearch/dockler.py --up

# log into container and run a build + crawl
docker exec -it sparkler-elastic /bin/bash
cd sparkler-core && mvn clean install

docker service with elasticsearch

observe 3 containers are running. 1 for sparkler, elasticsearch, kibana

docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES ed44c41f2a97 elasticsearch_sparkler "/data/start.sh" 23 minutes ago Up 23 minutes 0.0.0.0:4041->4041/tcp sparkler-elastic 37dcb87a9198 docker.elastic.co/kibana/kibana:7.11.1 "/bin/tini -- /usr/l…" 23 minutes ago Up 23 minutes 0.0.0.0:5601->5601/tcp kibana 2d5bd9ee19f4 docker.elastic.co/elasticsearch/elasticsearch:7.11.1 "/bin/tini -- /usr/l…" 23 minutes ago Up 23 minutes 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp elasticsearch

verify the elasticsearch service is reachable from sparkler

log into the sparkler image

docker exec -it sparkler-elastic /bin/bash curl http://elasticsearch:9200/status

verify the elasticsearch service is reachable from localhost

curl http://localhost:9200/status

lewismc commented 3 years ago

When I log into the container I see some issues

% docker exec -it sparkler-dev /bin/bash
...
/usr/local/bin/greeting.dev.sh: line 3: bad substitution: no closing "`" in ` | '__| |/ / |/ _ \ '__|
      ____) | |_) | (_| | |  |   <| |  __/ |
     |_____/| .__/ \__,_|_|  |_|\_\_|\___|_|
            | |
            |_|

You can access solr at http://localhost:8983/solr when solr is running
You can spark master UI at http://localhost:4041 when spark master is running

Some useful queries:

- Get stats on groups, status, depth:
    http://localhost:8983/solr/crawldb/query?q=*:*&rows=0&facet=true&&facet.field=crawl_id&facet.field=status&facet.field=group&facet.field=discover_depth

Inside docker, you can do the following:

solr - command line tool for administering solr
    start -force -> start solr
    stop -force -> stop solr
    status -force -> get status of solr
    restart -force -> restart solr

sparkler - command line interface to sparkler
   inject - inject seed urls
   crawl - launch a crawl job

build sparkler
    cd sparkler-core && mvn install
lewismc commented 3 years ago

Logging into the container, building sparkler and restarting Solr all goes well. When I attempt to inject I am getting a fatal error

bash-4.2# sparkler inject -id 1 -su 'https://www.jpl.nasa.gov/'
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2021-02-22 18:19:48 INFO  PluginService$:53 - Loading plugins...
2021-02-22 18:19:49 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-02-22 18:19:49 WARN  PluginService$:65 - 6 extra plugin(s) available but not activated: Set(fetcher-chrome, template-plugin, url-injector, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-02-22 18:19:49 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-02-22 18:19:49 INFO  PluginService$:73 - Extensions found: [edu.usc.irds.sparkler.plugin.RegexURLFilter@74eb909f]
2021-02-22 18:19:49 INFO  PluginService$:76 - Extensions lookup: PluginWrapper [descriptor=PluginDescriptor [pluginId=urlfilter-regex, pluginClass=edu.usc.irds.sparkler.plugin.RegexURLFilterActivator, version=0.2.2-SNAPSHOT, provider=edu.usc.irds.sparkler.plugin, dependencies=[], description=, requires=*, license=null], pluginPath=/data/sparkler-core/bin/../build/plugins/urlfilter-regex-0.2.2-SNAPSHOT.jar].getPluginId
2021-02-22 18:19:49 INFO  PluginService$:77 - Extensions id lookup: edu.usc.irds.sparkler.plugin.RegexURLFilter@74eb909f.getClass.getName
2021-02-22 18:19:49 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-02-22 18:19:49 INFO  PluginService$:73 - Extensions found: [edu.usc.irds.sparkler.plugin.UrlFilterSameHost@76012793]
2021-02-22 18:19:49 INFO  PluginService$:76 - Extensions lookup: PluginWrapper [descriptor=PluginDescriptor [pluginId=urlfilter-samehost, pluginClass=edu.usc.irds.sparkler.plugin.UrlFilterSameHostActivator, version=0.2.2-SNAPSHOT, provider=edu.usc.irds.sparkler.plugin, dependencies=[], description=, requires=*, license=null], pluginPath=/data/sparkler-core/bin/../build/plugins/urlfilter-samehost-0.2.2-SNAPSHOT.jar].getPluginId
2021-02-22 18:19:49 INFO  PluginService$:77 - Extensions id lookup: edu.usc.irds.sparkler.plugin.UrlFilterSameHost@76012793.getClass.getName
2021-02-22 18:19:49 INFO  PluginService$:82 - Recognised Plugins: Map(urlfilter-regex -> edu.usc.irds.sparkler.plugin.RegexURLFilter, urlfilter-samehost -> edu.usc.irds.sparkler.plugin.UrlFilterSameHost)
2021-02-22 18:19:49 INFO  Injector$:110 - Injecting 1 seeds
2021-02-22 18:19:49 WARN  SolrProxy:43 - Caught Error from server at http://localhost:8983/solr/crawldb: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/crawldb/update</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>

</body>
</html>
 while adding beans, trying to add one by one
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
    at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
    at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/crawldb: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404 Not Found</h2>
<table>
<tr><th>URI:</th><td>/solr/crawldb/update</td></tr>
<tr><th>STATUS:</th><td>404</td></tr>
<tr><th>MESSAGE:</th><td>Not Found</td></tr>
<tr><th>SERVLET:</th><td>default</td></tr>
</table>

</body>
</html>

    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:629)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:265)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:504)
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:479)
    at edu.usc.irds.sparkler.service.SolrProxy.commitCrawlDb(SolrProxy.scala:62)
    at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:114)
    at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
    at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
    at edu.usc.irds.sparkler.service.Injector.run(Injector.scala:45)
    at edu.usc.irds.sparkler.service.Injector$.main(Injector.scala:164)
    at edu.usc.irds.sparkler.service.Injector.main(Injector.scala)
    ... 6 more
2021-02-22 18:19:49 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

The problem is that the3 crawldb core does not exist.

lewismc commented 3 years ago

Also in the above commentary you state

cd sparkler-core
docker-compose --file docker-compose.dev.yml up --detach

However when I do this I get bash: docker-compose: command not found We need to update the documentation to install docker-compose via the correct package manager inside of the Docker machine.

nhandyal commented 3 years ago

Updated summary to reflect new changes. Run instructions have changed.

However when I do this I get bash: docker-compose: command not found

Updated the wiki to include a section on installing docker-compose. Also added a message in https://github.com/USCDataScience/sparkler/pull/214/files#diff-7678c21ccfb7190f82e6d92325762b9cbe38d4a894ed92a74a56d70a2a7792cb with a link to the wiki.

The problem is that the3 crawldb core does not exist.

These changes are meant to get sparkler to work with elasticsearch + kibana, therefore solr is not required. As a result we removed solr from the sparkler image for elasticsearch. The inject / crawl commands do not work at the moment because we have not configured sparkler to work with ES.

felixloesing commented 3 years ago

Added a new page to the project's wiki describing how to start the sparkler + elasticsearch + kibana docker-compose network: https://github.com/USCDataScience/sparkler/wiki/Elasticsearch-Backend

lewismc commented 3 years ago

@buggtb @thammegowda any comments?

buggtb commented 3 years ago

Quick scan LGTM +1