A scraper designed to harvest Bioschema's markup, in either JSON-LD or RDFa format, from a set of known web pages. Implementation decisions are discussed here.
There are 3 sub-modules:
Requirements for core:
Additional requirements for service:
Web has no additional requirements, but you require some way of running war files.
First clone the repo to your machine. Core is relied on by both service and web. However, core can be used in a standalone manner.
Provides the core functionality as an abstract class. Additionally, two example classes exist that can be used to scrape either a single given URL or a series of URLs from a given file. For most purposes this file scraper is likely to be sufficient and there is no need to explore further. If you follow the instructions below you will run the file scraper.
To use this:
1/ Default configuration is read from core > src > main > resources > configuration.properties
. To override some properties, create file localconfig.properties
in the directory where you will run the application, and give the new values of the properties as needed:
outputFolder
: currently all RDF is saved as NQuads to a folder.locationOfSitesFile
: location of the list of URLs you wish to scrape located. There is an example in core > src > main > resources > urls2scrape.txt
. Please note that you can set dynamic or static parsing on a per URL basis by adding a comma and static or dynamic after that to change the way that URL is scraped on the urls2scrape.txt
file.chromiumDriverLocation
: full path to the Chrome driver file. On Windows this will be called chromedriver.exe
maxLimitScrape
: maximum number of URLs to scrape (defaults to 5)schemaContext
: path to the Schema.org context filedynamic
: boolean setting (true or false) that will set the scraper to dynamic or static markup parsingA typical localconfig.properties file for Linux will look like this:
chromiumDriverLocation = /home/username/chrome/chromedriver
locationOfSitesFile = /home/username/bmuse/urls2scrape.txt
outputFolder = /home/username/bmuse/
maxLimitScrape = 100
A typical localconfig.properties file for Windows will look like this:
chromiumDriverLocation = C\:/Users/username/chrome/chromedriver.exe
locationOfSitesFile = C\:/Users/username/bmuse/urls2scrape.txt
outputFolder = C\:/Users/username/bmuse/
maxLimitScrape = 100
2/ Create/edit your list of urls file.
3/ Package with maven: mvn clean package
core > target
directory you will find two jars. The fat jar is called core-x.x.x-SNAPSHOT.jar
and the skinny jar is original-core-x.x.x-SNAPSHOT.jar
. 4/ Run the fat jar via maven or the command line: java -jar core-x.x.x-SNAPSHOT.jar
This will run the hwu.elixir.scrape.scraper.examples.FileScrapper
main class.
UTF-8 character encoding. On Windows systems, you need to force the UTF-8 charset with -Dfile.encoding=UTF-8
.
java -Dfile.encoding=UTF-8 -jar core-x.x.x-SNAPSHOT.jar
Log configuration. You may also override the default log configuration by copying src > main > resources > logback.xml
to your own file and run the application as follows:
java -Dlogback.configurationFile=./logback.xml -Dfile.encoding=UTF-8 -jar core-x.x.x-SNAPSHOT.jar
Note: file localconfig.Properties
will be saved back with additional property contextCounter
: this is an auto-incrementing count of the number of sites scraped. You can reset this count to 0 or simply delete the property from your localconfig.properties
file.
Assumes a database of URLs that need to be scraped. Will collect a list of URLs from the database, scrape them and write the output to a specified folder. The output will be in NQuads format.
To use this:
service > src > main > resources > META-INF > persistence.xml
.service > src > main > resources > setUpDatabaseScript.sql
. If you run this before running the program, it will start scraping immediately.service > src > main > resources > applications.properties
. You need to specify:
mvn clean package
from the top level, i.e., Scraper folder not the service folder.service > target
directory you will find service.jar
. Run it however you wish via maven or the command line, e.g., java -jar service.jar
.Still in development so use is not recommended. Goal: to provide a small web app that receives a URL as a request and returns the (bio)schema markup from that URL in a JSON format.
A project by SWeL funded through Elixir-Excelerate.