$pip install virtualenv
$pip install virtualenvwrapper
For further information on installing virtualenv and virtualenvwrapper: [http://docs.python-guide.org/en/latest/dev/virtualenvs/]
$ sudo apt-get install python-pip python-dev build-essential libxml2-dev libxslt1-dev
$ pip install virtualenv
$ sudo pip install virtualenv virtualenvwrapper
$ sudo pip install --upgrade pip
Create a backup of your .bashrc file
$ cp ~/.bashrc ~/.bashrc-org Create a backup of
$ printf '\n%s\n%s\n%s' '# virtualenv' 'export WORKON_HOME=~/virtualenvs' 'source /usr/local/bin/virtualenvwrapper.sh' >> ~/.bashrc
Enable the virtual environment
$ source ~/.bashrc
$ mkdir -p $WORKON_HOME
$ mkvirtualenv scrapi
To exit the virtual environment
$ deactivate
To enter the virtual environment
$ workon scrapi
Create a Github account Fork the scrapi repository to your account
Install Git
$ sudo apt-get update
$ sudo apt-get install git
$ git clone https://github.com/your-username/scrapi
Postgres is required only if "postgres" is specified in your settings, or if RECORD_HTTP_TRANSACTIONS is set to True
.
By far, the simplest option is to install the postgres Mac OSX app:
To instead install via command line, run:
$ brew install postgresql
$ ln -sfv /usr/local/homebrew/opt/postgresql/*.plist ~/Library/LaunchAgents
$ launchctl load ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist
Inside your scrapi checkout:
$ sudo apt-get update
$ sudo apt-get install postgresql
$ sudo service postgresql start
Inside your scrapi checkout:
$ sudo -u postgres createuser your-username
$ sudo -u postgres createdb -O your-username scrapi
Inside your scrapi checkout:
$ createdb scrapi
$ invoke apidb
Cassandra is required only if "cassandra" is specified in your settings, or if RECORD_HTTP_TRANSACTIONS is set to True
.
Note: Cassandra requires JDK 7.
$ brew install cassandra
Check which version of Java is installed by running the following command:
$ java -version
Use the latest version of Oracle Java 7 on all nodes.
Add the DataStax Community repository to the /etc/apt/sources.list.d/cassandra.sources.list
$ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
Add the DataStax repository key to your aptitude trusted keys.
$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -
Install the package.
$ sudo apt-get update
$ sudo apt-get install cassandra
$ cassandra
Or, if you'd like your cassandra session to be bound to your current session, run:
$ cassandra -f
and you should be good to go.
$ sudo apt-get install libpq-dev python-dev
$ pip install -r requirements.txt
$ pip install -r dev-requirements.txt
$ pip install -r requirements.txt
Or, if you'd like some nicer testing and debugging utilities in addition to the core requirements, run
$ pip install -r dev-requirements.txt
This will also install the core requirements like normal.
Note: Elasticsearch requires JDK 7.
$ brew install homebrew/versions/elasticsearch17
Install Java
$ sudo apt-get install openjdk-7-jdk
Download and install the Public Signing Key.
$ wget -qO - https://packages.elasticsearch.org/GPG-KEY-elasticsearch | sudo apt-key add -
Add the ElasticSearch repository to your /etc/apt/sources.list.
$ sudo add-apt-repository "deb http://packages.elasticsearch.org/elasticsearch/1.4/debian stable main"
Install the package
$ sudo apt-get update
$ sudo apt-get install elasticsearch
$ sudo service elasticsearch start
$ elasticsearch
Note, if you're developing locally, you do not have to run RabbitMQ!
$ brew install rabbitmq
$ sudo apt-get install rabbitmq-server
Create databases for Postgres and Elasticsearch - only for local development!
$ invoke reset_all
You will need to have a local copy of the settings. Copy local-dist.py into your own version of local.py:
cp scrapi/settings/local-dist.py scrapi/settings/local.py
Copy over the api settings:
cp api/api/settings/local-dist.py api/api/settings/local.py
If you installed Cassandra, Postgres, and Elasticsearch earlier, you will want add something like the following configuration to your local.py, based on the databases you have:
RECORD_HTTP_TRANSACTIONS = True # Only if cassandra or postgres are installed
RAW_PROCESSING = ['cassandra', 'postgres']
NORMALIZED_PROCESSING = ['cassandra', 'postgres', 'elasticsearch']
CANONICAL_PROCESSOR = 'postgres'
RESPONSE_PROCESSOR = 'postgres'
For raw and normalized processing, add the databases you have installed. Only add elasticsearch to normalized processing, as it does not have a raw processing module.
RAW_PROCESSING
and NORMALIZED_PROCESSING
are both lists, so you can add as many processors as you wish. CANONICAL_PROCESSOR
and RESPONSE_PROCESSOR
both are single processors only.
note: Cassandra processing will soon be phased out, so we recommend using Postgres for your processing needs. Either one will work for now!
If you'd like to use local storage, you will want to make sure your local.py has the following configuration:
RECORD_HTTP_TRANSACTIONS = False
NORMALIZED_PROCESSING = ['storage']
RAW_PROCESSING = ['storage']
This will save all harvested/normalized files to the directory archive/<source>/<document identifier>
note: Be careful with this, as if you harvest too many documents with the storage module enabled, you could start experiencing inode errors
If you'd like to be able to run all harvesters, you'll need to register for a PLOS API key, a Harvard Dataverse API Key, and a Springer API Key.
Add your API keys to the following line to your local.py file:
PLOS_API_KEY = 'your-api-key-here'
HARVARD_DATAVERSE_API_KEY = 'your-api-key-here'
SPRINGER_API_KEY = 'your-api-key-here'
$ invoke beat
to start the scheduler, and
$ invoke worker
to start the worker.
Run all harvesters with
$ invoke harvesters
or, just one with
$ invoke harvester harvester-name
For testing local development, running the mit
harvester is recommended.
Note: harvester-name is the same as the defined harvester "short name".
Invoke a harvester for a certain start date with the --start
or -s
argument. Invoke a harvester for a certain end date with the --end
or -e
argument.
For example, to run a harvester between the dates of March 14th and March 16th 2015, run:
$ invoke harvester harvester-name --start 2015-03-14 --end 2015-03-16
Either --start or --end can also be used on their own. Not supplying arguments will default to starting the number of days specified in settings.DAYS_BACK
and ending on the current date.
If --end is given with no --start, start will default to the number of days specified in settings.DAYS_BACK
before the given end date.
Writing a harvester for inclusion with scrAPI? If the provider makes their metadata available using the OAI-PMH standard, then autooai is a utility that will do most of the work for you.
To configure scrapi to work in a local OSF dev environment:
'elasticsearch'
is in the NORMALIZED_PROCESSING
list in scrapi/settings/local.py
share_v2
aliasMultiple SHARE indices may be used by the OSF. By default, OSF uses the share_v2
index. Activate this alias by running:
$ inv alias share share_v2
Note that aliases must be activated before the provider map is generated.
$ inv alias share share_v2
$ inv provider_map
To remove both the share
and share_v2
indices from elasticsearch:
$ curl -XDELETE 'localhost:9200/share*'
$ invoke test
and all of the tests in the 'tests/' directory will be run.
To run a test on a single harvester, just type
$ invoke one_test shortname
If you're using anaconda on your system at all, using pip to install all requirements from scratch from requirements.txt and dev-requirements.txt results in an Import Error when invoking tests or harvesters.
Example:
ImportError: dlopen(/Users/username/.virtualenvs/scrapi2/lib/python2.7/site-packages/lxml/etree.so, 2): Library not loaded: libxml2.2.dylib Referenced from: /Users/username/.virtualenvs/scrapi2/lib/python2.7/site-packages/lxml/etree.so Reason: Incompatible library version: etree.so requires version 12.0.0 or later, but libxml2.2.dylib provides version 10.0.0
To fix:
pip uninstall lxml
Answer found in this stack overflow question and answer
Scrapi supports the addition of institutions in a separate index (institutions
). Unlike data stored in the share
indices, institution's metadata is updated
much less frequently, meaning that simple parsers can be used to manually load data from providers instead of using scheduled harvesters.
Currently, data from GRID and IPEDS is supported:
grid_2015_11_05.json
, which can be found here or, for the full dataset, here. To use this dataset
move the file to '/institutions/', or override the file path and/or name on tasks.py
. This can be individually loaded using the function grid()
in tasks.py
.hd2014.csv
, which can be found here, by clicking on
Survey Data -> Complete data files -> 2014 -> Institutional Characteristics -> Directory information, or can be downloaded directly here. This will give you a file named HD2014.zip
, which can be unzipped into hd2014.csv
by running unzip HD2014.zip
. To use this dataset
move the file to '/institutions/', or override the file path and/or name on tasks.py
. This can be individually loaded using the function ipeds()
in tasks.py
.Running invoke institutions
will properly load up institution data into elastic search provided the datasets are provided.
Want to help save science? Want to get paid to develop free, open source software? Check out our openings!