AtlasOfLivingAustralia / name-preprocessing

Name source preprocessing for the ALA taxonomic index
Other
0 stars 1 forks source link

Taxonomic Data Retrieval and Processing

This library contains

Configuration

Generally, you will be working off a base directory with a number of sub-directories that hold input, output, configuration and working data.

Config directories

By default <base>/config. These directories hold configuration files for use during processing, for example taxonomic status translation maps. Source-specific or style-specific configuration can be held in sub-directories, with each directory searched from most- to least-specific. For example, APC loads will search config/APC, config/NSL, and config in order when looking for configuration files.

See data/config for the ALA configuration.

Input directories

By default, <base>/input Each data source has a subdirectory containing input data. For example, the input directory for CAAB data is input/CAAB

Output directories

By default, <base>/output Each data source has a subdirectory containing the resulting DwCA. For example, the output directory for CAAB data is output/CAAB

Working directories

Directories that hold intermediate results, error output and execution graphs. By default, <base>/work Each data source has a subdirectory for depositing work files etc. For example, the working directory for AFD data is work/AFD

List of sources

The configuration directory should contain a list of taxonomic sources, called sources.csv The sources file is a CSV table with information about where to get data from and how to process it. The columns in the file are:

Organisation information about the publisher, called ala-metadata.csv and placed in the config directory. This follows the CollectorySchema. The publisher information is added to the EML metadata file.

Running

all.py does the required heavy lifting, running

For example ./venv/bin/python ./all.py -d /data/naming -x --only afd,apc Use ./venv/bin/python ./all.py -h for a list of options.

Locations

Locations are derived from the Getty Insitutue Thesaurus of Geographic Names (TGN) http://vocab.getty.edu/ under the Open Data Commons Attribution Licence (OGC-By) 1.0 https://opendatacommons.org/licenses/by/1-0/

Data from the TGN can be downloaded in XML format from http://tgndownloads.getty.edu/ These can then be converted into a table by using an XSLT Script and then interpreted by the location program. The downloaded XML files come in a number of parts and you may need to edit them to ensure consistent namespaces and process them incrementally. In shell-script:

echo "locationID,parentLocationID,name,preferredName,otherNames,iso2,iso3,currency,type,decimalLatitude,decimalLongitude" > locations.csv
for f in TGN*.xml
do
echo $f
xsltproc /path/to/name-preprocessing/data/tgn.xslt $f >> locations.csv
done

The script assumes that you're interested in English names; edit as required. You may also have to do a little judicious editing to handle embedded quotes properly, as well.