Topic Contribs

Module for analyzing contributions to a topic on Wikipedia.

installation

git clone https://github.com/WikiEducationFoundation/TopicContribs.git
cd TopicContribs
python3 setup.py install

usage

> python3 -m topics.cmdline
cmdline
Usage:
    cmdline --dumps=<path_to_dumps> --out=<path_to_output_dir>
            [--apm=<article_project_path>] [--pl=<project_list_path>]
            [--threads=<num_threads>]
            [--verbose] [<cohort_file> ... ]
    cmdline (-h | --help)
Options:
    --dumps=<path_to_dumps>      Directory containing the metadata dumps
    --out=<path_to_output_dir>   Directory in which to put output files
    --apm=<article_project_path> Path to a csv of page_id project_name pairs.
    --pl=<project_list_path>     Path to a csv with all project_name's that you
                                    would like to be included in the count.
    --threads=<num_threads>      Number of threads to be used. All available
                                    will be used if not specified.
    <cohort_file>                File containing usernames of interest.
    -v, --verbose                Generate verbose output.

Input files

`path_to_dumps`

These must be full history dumps.

For minimal size and maximal parallelization use <wiki>-<date>-stub-meta-history<number>.xml.gz
If you want to use a single file <wiki>-<date>-stub-meta-history.xml.gz
If you already have the full text history dumps downloaded and you feel like using them <wiki>-<date>-pages-meta-history<number>.xml-<page_range>.bz2 will work.

You can use mwdumps to download the latest set of dumps: https://github.com/kjschiroo/python-mwdumps

python3 -m mwdumps.cmdline --wiki=enwiki -v /path/to/save/dumps

`article_project_path`

This file provides a map between articles and the projects they are included in. We expect it to be a .csv following the format

<page_id>,<project_name>

Generating this file

This file can be produced by running sql/page_project_map.sql on wmflabs and replacing <user_database> with your user database.

`project_list_path`

This is a file listing all of the project names we are interested in. The names must match those in the project_name column of the article_project_path file in order for the corresponding pages to be counted.

`cohort_file`

A file or set of files listing the usernames of the users we are interested in tracking. If multiple are used then each will be summed separately and output to a separate output file.

Output files

We will output one timeseries file for each cohort_file and one extra general file for all activity.

`topicutils`

You can use topicutils.tsvToCsv -i <input.tsv> -o <output.csv> to convert a .tsv generated by the wmflabs databases to a .csv.

WikiEducationFoundation / TopicContribs

readme