AtlasOfLivingAustralia / bie-index

Taxonomic search services
https://bie-ws.ala.org.au/ws
Other
1 stars 18 forks source link
ala-product-bie ala-systems bie conservation-status species taxon webservices

bie-index Build Status

bie-index is a grails web application that indexes taxonomic content in DwC-A and provides search web services for this content. This includes:

This project provides JSON webservices and an interface for admin users. It does not include a HTML interface for end users. There is a set of front end components available providing the species pages listed here:

For an introduction to the approach to names within the ALA, nameology is a good place to start.

Darwin Core archive format of taxonomic information

This application currently supports the ingestion of Darwin Core archive (DwC-A) with the following mandatory darwin core fields in the core file:

Additional fields can be added which will allow more sophisticated handling of names

Additional fields added to the core file e.g. establishmentMeans or any other field will also be indexed and available for facetted searching.

An extension file of vernacular names is also supported. The format support here aligns with the same format supported by the ala-names-matching API.

Additional fields , which will allow more sophisticated handling of vernacular names are:

An extension file of additional identifiers is also supported. The format aligns with the GBIF identfier format.

Additional fields, which will allow more sophisticated handling of identifiers are:

eml.xml

A Darwin Core Archive may contain an eml.xml metadata file, in the Ecological Metadata Language format. If available, default information is gathered from the metadata file:

Basic example meta.xml

Below is an example meta.xml that would be provided in a darwin core archive.

<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
  <core encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Taxon">
    <files>
      <location>taxon.csv</location>
    </files>
    <id index="0" />
    <field term="http://rs.tdwg.org/dwc/terms/taxonID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/parentNameUsageID"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsageID"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
    <field index="4" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
    <field index="5" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
  </core>
  <extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy="&quot;" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/VernacularName">
    <files>
      <location>vernacular.csv</location>
    </files>
    <coreid index="0" />
    <field term="http://rs.tdwg.org/dwc/terms/taxonID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>    
  </extension>
</archive>

Example archives

Integration points

In addition to indexing the content of the darwin core archive, the ingestion & index creation (optionally) indexes data from the following ALA components. It does this by harvesting JSON feeds from the listed components.

Architecture

This application makes use of the following technologies

Architecture image

Handling URLs as taxon IDs

Some taxonIDs are now URLs, rather than LSIDs. When provided to the server un-encoded, everything is fine. However, if encoded with slashes being replaced by %2F then tomcat treats this as a security error and returns a 400 error. To allow encoded slashes in tomcat, start the server with -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true

Image scans

An image scan will search the biocache for suitable images to act as an example image for the taxon. The image scan configuration can be located in /data/bie-index/config/image-lists.json and references by imageListsUrl in the configuration properties. An example image list configuration is

{
  "boosts": [
    "record_type:Image^10",
    "record_type:HumanObservaton^20",
    "record_type:Observation^20",
    "-record_type:PreservedSpecimen^20"
  ],
  "ranks": [
    {
      "rank": "family",
      "idField": null,
      "nameField": "family"
    },
    {
      "rank": "genus",
      "idField": "genus_guid",
      "nameField": "genus"
    },
    {
      "rank": "species",
      "idField": "species_guid",
      "nameField": "taxon_name"
    }
  ],
  "lists": [
    {
      "uid": "dr4778",
      "imageId": "imageId"
    }
  ]
}

For each candidate taxon, a query is constructed that boosts certain characteristics of the occurrence in the hopes of finding something that doesn't look terribly dead. The boosts element contains the list of boosts to apply (in the example, observations and images get a boost, preserved specimens are downgraded and images from dr130 are preferred.

The required and preferred elements contain lists of filter queries that are applied to the search. For example,geospatial_kosher:true restricts searches to occurrences that appear to be geospatially usable.

Only certain ranks get images attached to them. The ranks element contains taxon ranks that should have images associated with them, along with the fields in the occurrence records that allow an occurrence to be found. The idField provides a specific biocache field that will be searched for the taxon identifier. It can be left null for ranks that have no biocache index fields. The nameField provides a specific biocache field that will be searched for the taxon name. Again, it can be left null for ranks with no name. The lsid field is always searched, to see if there is an ocurrence record that is specifically named as the taxon.

The lists elment contains species lists that manually specify preferred heroic images for species. The species lists should have a column that supplies the image imageId, so that the image can be found on the image server. Multiple lists can be used, with highest priority going to the first entry. The uid holds the list identifier to load.

Vernacular Name Lists

Vernacular names can be drawn from species lists on the list server. The vernacular name configuration can be located in /data/bie-index/config/vernacular-lists.json and referenced by vernacularListsUrl in the configuration properties.

An example vernacular name list configuration is

{
  "defaultVernacularNameField": "name",
  "defaultNameIdField": "nameID",
  "defaultKingdomField": "kingdom",
  "defaultPhylumField": "phylum",
  "defaultClassField": "class",
  "defaultOrderField": "order",
  "defaultFamilyField": "family",
  "defaultRankField": "rank",
  "defaultStatusField": "status",
  "defaultLanguageField": "language",
  "defaultsourceField": "source",
  "defaultTemporalField": "temporal",
  "defaultLocationIdField": "locationID",
  "defaultLocalityField": "locality",
  "defaultCountryCodeField": "countryCode",
  "defaultSexField": "sex",
  "defaultLifeStageField": "lifeStage",
  "defaultIsPluralField": "isPlural",
  "defaultIsPreferredField": "isPreferred",
  "defaultOrganismPartField": "organismPart",
  "defaultLabelsField": "labels",
  "defaultTaxonRemarksField": "taxonRemarks",
  "defaultProvenanceField": "provenance",
  "defaultStatus": "common",
  "defaultLanguage": "en",  
  "lists": [
    {
      "uid": "drt1464664375273",
      "taxonRemarksField": "Notes",
      "defaultLanguage": "xul"
      "defaultStatus": "traditionalKnowledge"

    },
    {
      "uid": "drt1464664375274",
      "defaultLanguage": "fr",
      "statusField": "priority"
    }
  ]
}

The default entries provide useful defaults for things like the list fields that hold various pieces of information. These can be overridden at the list level. The various fields refer to the fields that can be part of the GBIF vernacular names extension. The defaultLanguage and defaultStatus entries provide per-list defaults for language and status entries. Languages should be ISO-639 two- or three-letter codes or AIATSIS codes; the bie-plugin can expand these out. The uid holds the list identifier to load.

Avoid using vernacularName or commonName as the vernacular name field The list server treats these in a special way, causing problems when attempting to retrieve the names.

Deprecated Names

Names with a status of deprecated appear last in lists of names and will not be used as the "headline" vernacular name. They are generally names that are now offensive or doubtful.

Conservation Status Lists

Conservation status information can also be drawn from species lists. The conservation configuration can be located in /data/bie-index/config/conservation-lists.json and referenced by conservationListsUrl in the configuration properties.

An example conservation status list configuration is

{
  "defaultSourceField": "status",
  "defaultKingdomField": "kingdom",
  "lists": [
    {
      "uid": "dr656",
      "field": "conservationStatusAUS_s",
      "term": "conservationStatusAUS",
      "label": "AUS"
    },
    {
      "uid": "dr655",
      "field": "conservationStatusVIC_s",
      "term": "conservationStatusVIC",
      "label": "VIC",
      "sourceField": "statusName",
      "kingdomField": "kgm"
    }
  ]
}

The uid supplies the list identifier. The field supplies the solr field which will be used to store the conservation status. The term supplies the name of the status field. label gives the label to apply to the conservation status. sourceField gives the name of the field that contains the conservation status. kingdomField gives the name of the field that contains the kingdom -- handy for name lookups, if available.

To use all species lists that are recorded as both authoritative and threatened, have lists as an empty array. These lists must have a status column indicating the conservation status.

Weighting Rules

Calculating weights for search and autosuggest operations gets rather complicated, score-calculated weights for seach operations are built into each document during the import process.

Anything with an idxtype field is annotated with weights.

The weighting rules come from a configuration file, which defaults to default-weights.json and which can be set by import.weightConfigUrl in the configuration. An example set of weighting rules is

{
  "script": "nashorn",
  "global": {
    "rules": [
      {
        "term": "taxonomicStatus",
        "exists": true,
        "rules": [
          {
            "value": "accepted",
            "weight": 2.0
          },
          {
            "value": "misapplied",
            "weight": 0.5
          }
        ]
      },
    ]
  },
  "weights": [
    {
      "field": "searchWeight",
      "rules": []
    },
    {
      "field": "suggestWeight",
      "rules": [
        {
          "term": "scientificName",
          "exists": true,
          "condition": "_value.length() > 4",
          "weightExpression": "_weight * 1.0 / (1.0 + Math.log(_value.length() * 0.01 + 1.0))",
          "comment": "The longer the name, the less it should be suggested. Mean name length is 16"
        }
      ]
    }
  ]
}

The top level contains the following fields:

Rules consist of the following entries:

As an example, the above rules, applied to the input [idxtype:'TAXON', taxonomicStatus:'misapplied', scientificName:'Atrobucca brevis'] and with a start value of 1.0 would give.

Favourites

The favourites function allows lists from the lists tool to be used to mark taxa or common names as having a "favourite" status. The favourite status is a term, such as preferred or iconic that can be used to mark entries for faceting and weight calculation.

The favourites configuation comes from a configuration file, which defaults to default-favourites.json and which can be set by import.favouritesConfigUrl in the configuration. An example favourites configuration is:

{
  "defaultTerm": "favourite",
  "lists": [
    {
      "uid": "dr4778",
      "termField": "favourite",
      "defaultTerm": "interest"
    },
    {
      "uid": "dr781",
      "defaultTerm": "iconic"
    }
  ]
}

The top level contains the following entries:

Each list can contain

Favourites only mark selected taxa and their associated common names with favourite terms. Once marked, it is up to the bie-plugin otr weighting rules to make use of these terms.