This is the component of Qcumber that scrapes the data off SOLUS, parses it, and generates structured data that the site can then display.
sudo
sudo apt-get install ...
This project has been designed to work with Python versions 2.7.x and 3.3.x You can try other versions, but no promises.
Python 3.3.x is recommended.
apt-get install python3 python3-dev
),
or get the source from http://www.python.org/download/ if your distribution doesn't have the correct version of Python availible.python3-dev
or python2-dev
). If you compile from source, these are already included.lxml
module:
apt-get install libxml2-dev libxslt1-dev
yum install libxml2-devel libxslt-devel
pacman -S libxml2 libxslt
apt-get install git
to install Git.Pip is used to install extra Python modules that aren't included by default. A virtual environment is an isolated Python environment. It allows for per-program environment configuration.
apt-get install python3-pip
(or python-pip
for 2.7.x users)pip install virtualenv
git@github.com:[yourusername]/qcumber-scraper.git
link on the page.git clone [repository]
, where [repository]
is the url you copied.qcumber-scraper
folder.Navigate into the qcumber-scraper
folder
Create a new virtual environment: virtualenv venv
If you have multiple versions of Python on your system, make sure to specify the correct one with a -p
switch (Ex: virtualenv -p /usr/bin/python3 venv
)
Activate the new environment: source venv/bin/activate
NOTE: you will need to activate the virtual environment every time you want to run the local project.
To deactivate the virtual environment: deactivate
Make sure you have activated your virtual environment (see above) before running this command!
pip install -r requirements.txt
The standard maintenance periods are Tuesdays and Thursdays from 5 am to 7:30 am and Sundays from 5 am to 10 am. There doesn't seem to be any place this is documented, but if you access the site during maintenance times it will tell you. You will need to run scrapes around these maintenance times.
python main.py
python textbooks.py
For better logging and debugging later it is recommended to redirect the output to log files. Something like:
python main.py >logs/debug.log 2>logs/error.log
To watch the logs as they happen, first open 2 other terminals, and run tailf logs/debug.log
in one, and tailf logs/error.log
in the other. Then start the main scrape command like above.