commonsearch / cosr-back

Backend of Common Search. Analyses webpages and sends them to the index.
https://about.commonsearch.org
Apache License 2.0
123 stars 24 forks source link

cosr-back

Chat with us on Slack Build Status Coverage Status Apache License 2.0

This repository contains the main components of the Common Search backend.

Your help is welcome! We have a complete guide on how to contribute.

Understand the project

This repository has 4 components:

Here is how they fit in our general architecture:

General technical architecture of Common Search

Local install

A complete guide available in INSTALL.md.

Launching the tests

See tests/README.md.

Using plugins

Common Search supports the insertion of user-provided plugins in its processing pipeline. Some are included by default, for instance:

make docker_shell
spark-submit spark/jobs/pipeline.py --source url:https://about.commonsearch.org/ --plugin plugins.grep.Words:words="common search",output=/tmp/grep_result

See the plugins/ directory for more examples and Analyzing the web with Spark for a complete tutorial.

Launching the explainer

The explainer allows you to debug results easily. Just run:

make docker_explainer

Then open http://192.168.99.100:9703 in your browser (Assuming 192.168.99.100 is the IP of your Docker host)

Launching an index job

make docker_shell
spark-submit spark/jobs/pipeline.py --source commoncrawl:limit=1 --plugin plugins.filter.Homepages:index_body=1 --profile

After this, if you have a cosr-front instance connected to the same Elasticsearch service, you will see the results!

A tutorial is currently being written on this topic.