README

master:

This is the crawler component for Buzzbang, a project to enable applications to find and use Bioschemas markup, and Google-like search over it for humans. Please see https://github.com/buzzbangorg/buzzbang-doc/wiki for more information.

Usage

These instructions are for Linux. Windows is not supported.

1. Create the intermediate crawl database

./setup/bsbang-setup-sqlite.py <path-to-crawl-db>

Example:

./setup/bsbang-setup-sqlite.py data/crawl.db

2. Queue URLs for Bioschemas JSON-LD extraction by adding them directly and crawling sitemaps

./bsbang-crawl.py <path-to-crawl-db> <location>

The location can be:

a sitemap (e.g. http://beta.synbiomine.org/synbiomine/sitemap.xml)
a webpage (e.g. http://identifiers.org or file://test/examples/FAIRsharing.html)
a path (e.g. conf/default-targets.txt which will then crawl all the locations in that file)

Example:

./bsbang-crawl.py data/crawl.db conf/default-targets.txt

3. Extract Bioschemas JSON-LD from webpages and insert into the crawl database.

./bsbang-extract.py <path-to-crawl-db>

** To download the crawled data from the database -

./bsbang-dump.py <path-to-crawl-db> <path-to-save-jsonld>

4. Install Solr.

5. Create a Solr core named 'bsbang'

cd $SOLR/bin
./solr create -c bsbang

6. Run Solr setup

cd $BSBANG
./setup/bsbang-setup-solr.py <path-to-bsbang-config-file> --solr-core-url <URL-of-solr-endpoint>

Example:

./setup/bsbang-setup-solr.py conf/bsbang-solr-setup.xml --solr-core-url http://localhost:8983/solr/testcore/

7. Index the extracted Bioschemas JSON-LD in Solr

./bsbang-index.py <path-to-crawl-db> --solr-core-url <URL-of-solr-endpoint>

Example:

./bsbang-index.py data/crawl.db --solr-core-url http://localhost:8983/solr/testcore/

Frontend

See https://github.com/justinccdev/bsbang-frontend for a frontend project to the index.

Tests

$ python3 -m unittest discover

TODO

Future possibilities include:

Possibly switch to using a 3rd party crawler or components rather than this custom-built one. Please see https://github.com/justinccdev/bsbang-crawler/issues/5
Make crawler periodically re-crawl.
Understand much more structure (e.g. DataSet elements within DataCatalog).
Parse other Bioschemas and schema.org types used by life sciences websites (e.g. Organization, Service, Product)
Instead of using Sqlite as intermediate crawl store, use something more scalable (perhaps mongodb, cassandra, etc.). But also see the item where we may want to replace parts/most of crawling infrastructure with a third party project, which will already have solved some, if not all, of the scalability issues.
Crawl and understand PhysicalEntity/BioChemEntity/ResearchEntity once this matures further.

Any other suggestions welcome as Github issues for discussion or as pull requests.

Hacking

Contributions welcome! Please

Make pull requests to the dev branch.
Conform to the PEP 8 style guide.

Thanks!

buzzbangorg / bsbang-crawler

readme

README

Usage

Frontend

Tests

TODO

Hacking