This is the crawler component for Buzzbang, a project to enable applications to find and use Bioschemas markup, and Google-like search over it for humans. Please see https://github.com/buzzbangorg/buzzbang-doc/wiki for more information.
These instructions are for Linux. Windows is not supported.
1. Create the intermediate crawl database
./setup/bsbang-setup-sqlite.py <path-to-crawl-db>
Example:
./setup/bsbang-setup-sqlite.py data/crawl.db
2. Queue URLs for Bioschemas JSON-LD extraction by adding them directly and crawling sitemaps
./bsbang-crawl.py <path-to-crawl-db> <location>
The location can be:
http://beta.synbiomine.org/synbiomine/sitemap.xml
)http://identifiers.org or file://test/examples/FAIRsharing.html
)conf/default-targets.txt
which will then crawl all the locations in that file)Example:
./bsbang-crawl.py data/crawl.db conf/default-targets.txt
3. Extract Bioschemas JSON-LD from webpages and insert into the crawl database.
./bsbang-extract.py <path-to-crawl-db>
** To download the crawled data from the database -
./bsbang-dump.py <path-to-crawl-db> <path-to-save-jsonld>
4. Install Solr.
5. Create a Solr core named 'bsbang'
cd $SOLR/bin
./solr create -c bsbang
6. Run Solr setup
cd $BSBANG
./setup/bsbang-setup-solr.py <path-to-bsbang-config-file> --solr-core-url <URL-of-solr-endpoint>
Example:
./setup/bsbang-setup-solr.py conf/bsbang-solr-setup.xml --solr-core-url http://localhost:8983/solr/testcore/
7. Index the extracted Bioschemas JSON-LD in Solr
./bsbang-index.py <path-to-crawl-db> --solr-core-url <URL-of-solr-endpoint>
Example:
./bsbang-index.py data/crawl.db --solr-core-url http://localhost:8983/solr/testcore/
See https://github.com/justinccdev/bsbang-frontend for a frontend project to the index.
$ python3 -m unittest discover
Future possibilities include:
Any other suggestions welcome as Github issues for discussion or as pull requests.
Contributions welcome! Please
Thanks!