AtlasOfLivingAustralia / ala-name-matching

Atlas name matching API and index generation
Other
10 stars 13 forks source link
ala-product-taxonomy ala-systems dwca lucene matching-algorithm species synonyms taxonomy

ala-name-matching Build Status

Atlas Name matching API

This is the API in use by the Atlas of Living Australia to match scientific name to taxon concepts. This API borrows heavily from the name parsing great work done by GBIF in their scientific name parser library This code contains additions for handling some Australian specific issues.

Modules

Versions

Version 4.x of the library uses Lucene 8.

Generating a name match index

The name match index can be built from multiple Darwin Core Archives (DwCAs) that contains all the scientific names that you wish to add (including synonyms). The DwCA can also contain an optional vernacular name extension. The taxon DwCA must have a core row type of http://rs.tdwg.org/dwc/terms/Taxon The vernacular name extension must have a row type of http://rs.gbif.org/terms/1.0/VernacularName

Additional vernacular names (matched by scientific name, along with author and any other taxonomic hints that you have) can be supplied by a DwCA with a core row type of http://rs.gbif.org/terms/1.0/VernacularName These names are matched against the taxon information supplied above.

There is an example Catalogue of Life DwcA that can be downloaded here:

dwca-col.zip

Users can modify the col_dwc.txt file to include any additional species names.

The name matching index can also support common names. Here are the Catalogue of Life common names that can be loaded in conjunction to the Darwin Core Archive:

col_vernacular.txt.zip

The name matching supports homonym detection. Homonym detection is supported through the using of IRMNG. You can download the IRMNG DwCA for homonyms from the following URL:

IRMNG_DWC_HOMONYMS.zip

An assembly zip file for this can be downloaded from our maven repository :

ala-name-matching-4.3-distribution.zip

To generate the name index using the data described above, follow these steps. Alternatively use the ALA Ansible scripts here using the playbook nameindexer.yml which does it all for you.

./index.sh --all --dwca /data/names/dwca-col --target /data/lucene/testdwc-namematching --irmng /data/names/irmng/IRMNG_DWC_HOMONYMS --common /data/names/col_vernacular.txt

Please be aware that the names indexing could take over an hour to complete.

Generating a combined DwCA

The build process above works most effectively when given a consistent taxonomy. The taxonomy builder takes multiple taxonomies, along with a configuration that assigns priorities to the different entries in the taxonomies and merges the sources into a single, combined taxonomy.

An example command for the taxonomy builder is:

./merge.sh -c /data/names/ala-taxon-config.json -w /tmp -o /data/names/combined /data/names/APNI/DwC /data/names/AFD/DwC /data/names/CAAB/DwC

More information about the merge configuration can be found here.

Build notes

This library is built with maven. By default a mvn install will try to run a test suite which will fail without a local installation of a name index. To skip this step, run a build with mvn install -DskipTests=true.

The build creates one artefact in the ala-name-matching-distribution/target directory:

Each module contains two artefacts in the ala-name-matching/ala-name-matching-<module>/target directory:

The name index for Australian names lists used in unit tests can be downloaded from here and needs to be extracted to the directory /data/lucene/namematching-20210811-5

ALA Names List

The ALA sources most of its names from the National Species List (NSL), which is made up of the Australian Faunal Directory (AFD), Australian Plant Census (APC) and the Australian Plant Name Index (APNI). These data sources are not complete.
In areas where this is most apparent we attempt to pad out known families with missing genera and species. This becomes most apparent in the Birds and Fish area. One major risk associated with this is adding duplicate species because AFD is missing synonym relationships.

One source we use to include missing species is the Codes for Australian Aquatic Biota (CAAB) species list.
We take all the species in CAAB, that have distributions in Australian waters, and add them if they do not exist in AFD.

We use AusFungi to supply all the Fungi and AusMoss to supply all the mosses. These lists will eventually become part of the NSL, but until then we merge them using DwCA supplied by AusMoss and AusFungi directly.

We use the New Zealand Organisms Register (NZOR) for New Zealand species.

We pad out the Birds and Jellyfish branches of AFD with species from Catalog of Life 2012 (CoL).

We also use CoL to supply the complete classification of kingdoms that are missing from the NSL. At the moment this encompasses Viruses, Chromista, Protozoa and Bacteria.

This names list is used as a backbone for the ALA species pages and to create a name matching index.

Using ALA Name Matching

The ALA Name Matching is available as a library that can be used in other projects. It is available in the ALA Maven Repository (http://nexus.ala.org.au/).

To use ala-name-matching, include it as a dependency in your pom file:

<dependency>
  <groupId>au.org.ala</groupId>
  <artifactId>ala-name-matching-search</artifactId>
  <version>4.3</version>
</dependency>

If you just want the handy enums and such-like, use

<dependency>
  <groupId>au.org.ala</groupId>
  <artifactId>ala-name-matching-model</artifactId>
  <version>4.3</version>
</dependency>

If you are using grails 3, you may encounter problems with the newer GBIF libraries having validation code that conflicts with spring validation. You can correct this by using

compile("au.org.ala:ala-name-matching-search:4.3") {
    exclude group: 'org.slf4j', module: 'slf4j-log4j12'
    exclude group: 'org.apache.bval', module: 'org.apache.bval.bundle'
}

Download the most recently generated name matching index:

http://biocache.ala.org.au/archives/namematching/YYYMMDD/namematching-YYYMMDD.tgz

Unzip this into a /data/lucene directory and create a symbolic link from namematching to the datestamped directory. In your program create a single new ALANameSearcher to perform all your searches

ALANameSearcher  searcher = new ALANameSearcher ("/data/lucene/namematching")

The easiest way to perform a search is to have the searcher handle all the exceptional situations using the default handling:

LinnaeanRankClassification cl = new LinnaeanRankClassification()
cl.setScientificName("Macropus rufus")
String lsid = searcher.searchForAcceptedLsidDefaultHandling(cl,true)
NameSearchResult result = searcher.searchForAcceptedRecordDefaultHandling(cl, true)

Understanding the Name Matching Algorithm

When the name matching index is created the scientific name is stored is several formats.

These formats allow a variety of match types to be performed.

There are 2 distinct phases in the match process.

Error Types

This section outlines the errors that can be returned as part of a MetricsResultDTO (obtained using the ALANameSearcher.searchForRecord Metric methods)

Glossary

Example of a phrase name:
Stylidium sp. Boulder Rock (A.H. Burbidge 2536)<br>
Genus = Stylidium<br>
Phrase = Boulder Rock<br>
Voucher = A.H. Burbidge 2536<br>

Here is a link to all the biocache records that have been matched to a phrase name:

http://biocache.ala.org.au/occurrences/search?q=*:*&fq=name_match_metric:phraseMatch

Release notes

Release notes v2.4.6

Release notes v2.1

Release notes v2.0

Release notes v1.3