PyAr / CDPedia

CDPedia is a project to make the Wikipedia accesable offline
34 stars 15 forks source link

CDPedia

CDPedia is a project to make the Wikipedia accesable offline.

Check also the official page for normal humans.

How to create an image

All is automated nowadays, but you need to be sure that there is configuration for the image type you want to produce.

For example, let's suppose you want to create a DVD version of Spanish Wikipedia. Then, you need to be sure that there is configuration for es in the languages.yaml file, and for dvd (in the es section) in the imagtypes.yaml file.

The next step is to run the CDPetron (for which you need to first create and activate a virtualenv):

virtualenv --python=python3 venv
source venv/bin/activate
pip install -r requirements-dev.txt  

Then run the CDPetron itself:

./cdpetron.py /opt/somedir es

Alternatively, just use fades to deal with the virtualenv automatically:

fades -r requirements-dev.txt cdpetron.py /opt/somedir es

The first parameter is where all the dump from the web will go (pages, images, etc... be sure you have a lot of free space!), and then the language.

In those examples CDPetron will produce the tarbig image type that includes all the articles and most of the pictures. To specify another image type to be built, you can use the --image-type option (remember that it needs to be defined in the imagtypes.yaml file):

./cdpetron.py /opt/somedir es --image-type dvd5

If the process is interrupted (a full CDPedia generation may take days), don't despair! The generation process checks for what is already done to avoid doing it again; so even it doesn't exactly resumes from where was interrupted, a lot of time is saved in the subsequent runs. That said, you may want to use --no-lists and --no-scrap options, to avoid getting fresh info to work on.

Dependencies

The following programs are used by the building process (needs to be previously installed):

pngquant
pip

Some helpers (like run or test) use fades to deal with virtualenvs automatically, it's a good idea to also have it installed.

No further dependencies are needed by the final running CDPedia, the creation process already manages its Python dependencies.

Quick image creation

If you're just developing and want to do a quick test, you can run the CDPetron with the --test-mode option, and it will not dump everything from the web, just some pages.

./cdpetron.py /opt/somedir es --test-mode

Also, you have several parameters like --no-lists, --no-scrap, and --no-clean which will help you to not do everything again on every test cycle. Run the CDPetron with --help for info about those.

Creating an image with specific pages

The cdpetron script has a specific option to help testing some specific pages when developing.

These are not to be confused with --test-mode, which builds a small functional CDPedia, but with first 1000 pages, not the ones you want to check. That said, the best way to use these are together with --test-mode, which makes everything faster.

First one is --extra-pages, which allows you to specify a file with a list of pages to be downloaded.

Second one is --page-limit, to limit the quantity of pages to download/scrap.

For example, then:

./cdpetron.py /opt/somedir es --test-mode --extra-pages=/tmp/extra.txt --page-limit=50

How to add a new language

CDpedia is multilanguage, so you can generate it in Spanish, Portuguese, German, or whatever, with the only condition than the there is a Wikipedia online for that language

Currently in the project everything is setup for the following languages:

You can add the proper structures for another language of your preference, and generate the CDPedia for that language. We encourage you to submit a PR with those structures for the new language so they are available for everybody else, thanks!

So, to add a new language you need to take care of different things: language configuration (imagtypes.yaml and languages.yaml), service texts for the running CDPedia, and project web page for the public.

Let's see these in detail (you can see current files for real life examples):