This extension provides several helpful functionalities for OpenRefine users who want to edit (structured data of) media files (images, videos, PDFs...) on Wikimedia Commons. For more info, documentation and how-tos about OpenRefine for Wikimedia Commons, see https://commons.wikimedia.org/wiki/Commons:OpenRefine.
Features included in this extension:
extractFromTemplate
and value.extractCategories
It works with OpenRefine 3.6.x and later versions of OpenRefine. It is not compatible with OpenRefine 3.5.x or earlier. (OpenRefine supports editing Wikimedia Commons from version 3.6; this is not possible in earlier versions.)
This extension was first released in October 2022. It has been funded by a Wikimedia project grant.
Download the .zip file of the latest release of this extension. Unzip this file and place the unzipped folder in your OpenRefine extensions folder. Read more about installing extensions in OpenRefine's user manual.
When this extension is installed correctly, you will now see the additional option 'Wikimedia Commons' when starting a new project in OpenRefine.
After installing this extension, click the 'Wikimedia Commons' option to start a new project in OpenRefine. You will be prompted to add one or more Wikimedia Commons categories.
There's no need to type the Category: prefix.
You can specify category depth by typing or selecting a number in the input field after each category. Depth 0
means only files from the current category level; depth 1
will retrieve files from one sub-category level down, etc.
Next, in the project preview screen (Configure parsing options
), you can choose to also include a column with each file's M-id (unique MediaInfo identifier) and/or Commons categories.
File names will already be reconciled when your project starts.
When you load larger categories (thousands of files) in a new project, OpenRefine will start slowly and will give you a memory warning. This is a known issue. Wait for a bit; the project will eventually start. The Commons Extension has been tested with a project of more than 450,000 files.
The Wikimedia Commons Extension also enables two dedicated GREL commands, which help to extract specific information from the Wikitext of Wikimedia Commons files. (GREL, General Refine Expression Language, is a dedicated scripting language used in OpenRefine for many flexible data operations. For a general reference on using GREL in OpenRefine, see https://docs.openrefine.org/manual/grelfunctions.)
Firstly, retrieve the Wikitext from a list of Commons files in your project. In the column menu of the reconciled file names' column, select Edit column
> Add column from reconciled values...
and select Wikitext
in the resulting dialog window.
From this new column with Wikitext, you can now extract values and categories as described below. Start by selecting Edit column
> Add column based on this column...
in the column menu. In the next dialog window, you can use various specific GREL commands:
extractFromTemplate
Use the following syntax:
extractFromTemplate(value, "BHL", "source")[0]
where you replace BHL
with the name of the template (without curly brackets) and source
with the parameter from which you want to extract the value. This GREL syntax will return the first (and usually the only) value of said parameter, e.g. https://www.flickr.com/photos/biodivlibrary/10329116385
.
value.extractCategories
Use the following syntax:
value.extractCategories().join('#')
This GREL syntax will return all categories mentioned in the Wikitext, separated by the #
character, which you can then use to split the resulting cell further as needed.
Run
mvn package
This creates a zip file in the target
folder, which can then be installed in OpenRefine.
To avoid having to unzip the extension in the corresponding directory every time you want to test it, you can also use another set up: simply create a symbolic link from your extensions folder in OpenRefine to the local copy of this repository. With this setup, you do not need to run mvn package
when making changes to the extension, but you will still to compile it with mvn compile
if you are making changes to Java files, and restart OpenRefine if you make changes to any files.
master
branch and it is up to date (git pull
)pom.xml
and set the version to the desired version number, such as <version>0.1.0</version>
git tag -a v0.1.0 -m "Version 0.1.0"
(when working from GitHub Desktop, you can follow this process and manually add the v0.1.0
tag with the description Version 0.1.0
)git push --tags
(in GitHub Desktop, just push again)mvn package
target
subfolder of your local copy of the repository).pom.xml
and set the version to the expected next version number, followed by -SNAPSHOT
. For instance, if you just released 0.1.0, you could set <version>0.1.1-SNAPSHOT</version>