This library contains
Generally, you will be working off a base directory with a number of sub-directories that hold input, output, configuration and working data.
By default <base>/config
.
These directories hold configuration files for use during processing,
for example taxonomic status translation maps.
Source-specific or style-specific configuration can be held in sub-directories,
with each directory searched from most- to least-specific.
For example, APC loads will search config/APC
, config/NSL
, and config
in order when looking for configuration files.
See data/config for the ALA configuration.
By default, <base>/input
Each data source has a subdirectory containing input data.
For example, the input directory for CAAB data is input/CAAB
By default, <base>/output
Each data source has a subdirectory containing the resulting DwCA.
For example, the output directory for CAAB data is output/CAAB
Directories that hold intermediate results, error output and execution graphs.
By default, <base>/work
Each data source has a subdirectory for depositing work files etc.
For example, the working directory for AFD data is work/AFD
The configuration directory should contain a list of taxonomic sources, called sources.csv
The sources file is a CSV table with information about where to get data from and how to process it.
The columns in the file are:
--only
optionafd
Australian Faunal Directory dumpnsl
National Species Lists dumpadditional_nsl
Additional names in a National Species Lists dump that have not yet been placed in a taxonomyausfungi
Old-style AusFungi DwCAcaab
Codes for Australian Aquatic Biota spreadsheetnzor
New Zealand Organisms Register DwCAcol
Catalogue of Life annual checklist DwCA (2019 version)ala
Atlas of Living Australia species listala_vernacular_list
Atlas of Living Australia vernacular names listgithub
A species list pulled from githubcommon
inferredAccepted
inferredSynonym
Organisation information about the publisher, called ala-metadata.csv
and placed in the config directory.
This follows the CollectorySchema.
The publisher information is added to the EML metadata file.
all.py
does the required heavy lifting, running
For example ./venv/bin/python ./all.py -d /data/naming -x --only afd,apc
Use ./venv/bin/python ./all.py -h
for a list of options.
Locations are derived from the Getty Insitutue Thesaurus of Geographic Names (TGN) http://vocab.getty.edu/ under the Open Data Commons Attribution Licence (OGC-By) 1.0 https://opendatacommons.org/licenses/by/1-0/
Data from the TGN can be downloaded in XML format from http://tgndownloads.getty.edu/ These can then be converted into a table by using an XSLT Script and then interpreted by the location program. The downloaded XML files come in a number of parts and you may need to edit them to ensure consistent namespaces and process them incrementally. In shell-script:
echo "locationID,parentLocationID,name,preferredName,otherNames,iso2,iso3,currency,type,decimalLatitude,decimalLongitude" > locations.csv
for f in TGN*.xml
do
echo $f
xsltproc /path/to/name-preprocessing/data/tgn.xslt $f >> locations.csv
done
The script assumes that you're interested in English names; edit as required. You may also have to do a little judicious editing to handle embedded quotes properly, as well.