buzzbangorg / bsbang-crawler

Alpha project for crawling bioschemas JSON-LD
Apache License 2.0
4 stars 5 forks source link

README

master:CircleCI

This is the crawler component for Buzzbang, a project to enable applications to find and use Bioschemas markup, and Google-like search over it for humans. Please see https://github.com/buzzbangorg/buzzbang-doc/wiki for more information.

Usage

These instructions are for Linux. Windows is not supported.

1. Create the intermediate crawl database

./setup/bsbang-setup-sqlite.py <path-to-crawl-db>

Example:

./setup/bsbang-setup-sqlite.py data/crawl.db

2. Queue URLs for Bioschemas JSON-LD extraction by adding them directly and crawling sitemaps

./bsbang-crawl.py <path-to-crawl-db> <location>

The location can be:

Example:

./bsbang-crawl.py data/crawl.db conf/default-targets.txt

3. Extract Bioschemas JSON-LD from webpages and insert into the crawl database.

./bsbang-extract.py <path-to-crawl-db>

** To download the crawled data from the database -

./bsbang-dump.py <path-to-crawl-db> <path-to-save-jsonld>

4. Install Solr.

5. Create a Solr core named 'bsbang'

cd $SOLR/bin
./solr create -c bsbang

6. Run Solr setup

cd $BSBANG
./setup/bsbang-setup-solr.py <path-to-bsbang-config-file> --solr-core-url <URL-of-solr-endpoint>

Example:

./setup/bsbang-setup-solr.py conf/bsbang-solr-setup.xml --solr-core-url http://localhost:8983/solr/testcore/

7. Index the extracted Bioschemas JSON-LD in Solr

./bsbang-index.py <path-to-crawl-db> --solr-core-url <URL-of-solr-endpoint>

Example:

./bsbang-index.py data/crawl.db --solr-core-url http://localhost:8983/solr/testcore/

Frontend

See https://github.com/justinccdev/bsbang-frontend for a frontend project to the index.

Tests

$ python3 -m unittest discover

TODO

Future possibilities include:

Any other suggestions welcome as Github issues for discussion or as pull requests.

Hacking

Contributions welcome! Please

Thanks!