Name source preprocessing for the ALA taxonomic index
Taxonomic Data Retrieval and Processing

Generally, you will be working off a base directory with a number of sub-directories that hold input, output, configuration and working data.

Config directories

By default <base>/config. These directories hold configuration files for use during processing, for example taxonomic status translation maps. Source-specific or style-specific configuration can be held in sub-directories, with each directory searched from most- to least-specific. For example, APC loads will search config/APC, config/NSL, and config in order when looking for configuration files.

See data/config for the ALA configuration.

Input directories

By default, <base>/input Each data source has a subdirectory containing input data. For example, the input directory for CAAB data is input/CAAB

Output directories

By default, <base>/output Each data source has a subdirectory containing the resulting DwCA. For example, the output directory for CAAB data is output/CAAB

Working directories

Directories that hold intermediate results, error output and execution graphs. By default, <base>/work Each data source has a subdirectory for depositing work files etc. For example, the working directory for AFD data is work/AFD

List of sources

The configuration directory should contain a list of taxonomic sources, called sources.csv The sources file is a CSV table with information about where to get data from and how to process it. The columns in the file are:

Organisation information about the publisher, called ala-metadata.csv and placed in the config directory. This follows the CollectorySchema. The publisher information is added to the EML metadata file.

Running does the required heavy lifting, running

For example ./venv/bin/python ./ -d /data/naming -x --only afd,apc Use ./venv/bin/python ./ -h for a list of options.


Locations are derived from the Getty Insitutue Thesaurus of Geographic Names (TGN) under the Open Data Commons Attribution Licence (OGC-By) 1.0

Data from the TGN can be downloaded in XML format from These can then be converted into a table by using an XSLT Script and then interpreted by the location program. The downloaded XML files come in a number of parts and you may need to edit them to ensure consistent namespaces and process them incrementally. In shell-script:

echo "locationID,parentLocationID,name,preferredName,otherNames,iso2,iso3,currency,type,decimalLatitude,decimalLongitude" > locations.csv
for f in TGN*.xml
echo $f
xsltproc /path/to/name-preprocessing/data/tgn.xslt $f >> locations.csv

The script assumes that you're interested in English names; edit as required. You may also have to do a little judicious editing to handle embedded quotes properly, as well.