bie-index is a grails web application that indexes taxonomic content in DwC-A and provides search web services for this content. This includes:
This project provides JSON webservices and an interface for admin users. It does not include a HTML interface for end users. There is a set of front end components available providing the species pages listed here:
For an introduction to the approach to names within the ALA, nameology is a good place to start.
This application currently supports the ingestion of Darwin Core archive (DwC-A) with the following mandatory darwin core fields in the core file:
Additional fields can be added which will allow more sophisticated handling of names
<span class="...">
elements. If not explicitly present, it is constructed from the information available. See Name Formatting for details.Additional fields added to the core file e.g. establishmentMeans or any other field will also be indexed and available for facetted searching.
An extension file of vernacular names is also supported. The format support here aligns with the same format supported by the ala-names-matching API.
Additional fields , which will allow more sophisticated handling of vernacular names are:
An extension file of additional identifiers is also supported. The format aligns with the GBIF identfier format.
Additional fields, which will allow more sophisticated handling of identifiers are:
A Darwin Core Archive may contain an eml.xml
metadata file, in the Ecological Metadata Language format.
If available, default information is gathered from the metadata file:
eml/dataset/title
Below is an example meta.xml that would be provided in a darwin core archive.
<archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
<core encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.tdwg.org/dwc/terms/Taxon">
<files>
<location>taxon.csv</location>
</files>
<id index="0" />
<field term="http://rs.tdwg.org/dwc/terms/taxonID"/>
<field index="1" term="http://rs.tdwg.org/dwc/terms/parentNameUsageID"/>
<field index="2" term="http://rs.tdwg.org/dwc/terms/acceptedNameUsageID"/>
<field index="3" term="http://rs.tdwg.org/dwc/terms/scientificName"/>
<field index="4" term="http://rs.tdwg.org/dwc/terms/scientificNameAuthorship"/>
<field index="5" term="http://rs.tdwg.org/dwc/terms/taxonRank"/>
</core>
<extension encoding="UTF-8" fieldsTerminatedBy="," linesTerminatedBy="\n" fieldsEnclosedBy=""" ignoreHeaderLines="1" rowType="http://rs.gbif.org/terms/1.0/VernacularName">
<files>
<location>vernacular.csv</location>
</files>
<coreid index="0" />
<field term="http://rs.tdwg.org/dwc/terms/taxonID"/>
<field index="1" term="http://rs.tdwg.org/dwc/terms/vernacularName"/>
</extension>
</archive>
In addition to indexing the content of the darwin core archive, the ingestion & index creation (optionally) indexes data from the following ALA components. It does this by harvesting JSON feeds from the listed components.
This application makes use of the following technologies
Some taxonIDs are now URLs, rather than LSIDs.
When provided to the server un-encoded, everything is fine.
However, if encoded with slashes being replaced by %2F
then tomcat treats this as a security error and
returns a 400 error.
To allow encoded slashes in tomcat, start the server with -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true
An image scan will search the biocache for suitable images to act as an example image for the taxon.
The image scan configuration can be located in /data/bie-index/config/image-lists.json
and
references by imageListsUrl
in the configuration properties.
An example image list configuration is
{
"boosts": [
"record_type:Image^10",
"record_type:HumanObservaton^20",
"record_type:Observation^20",
"-record_type:PreservedSpecimen^20"
],
"ranks": [
{
"rank": "family",
"idField": null,
"nameField": "family"
},
{
"rank": "genus",
"idField": "genus_guid",
"nameField": "genus"
},
{
"rank": "species",
"idField": "species_guid",
"nameField": "taxon_name"
}
],
"lists": [
{
"uid": "dr4778",
"imageId": "imageId"
}
]
}
For each candidate taxon, a query is constructed that boosts certain characteristics of the occurrence in the
hopes of finding something that doesn't look terribly dead. The boosts
element contains the list of boosts to apply
(in the example, observations and images get a boost, preserved specimens are downgraded and images from dr130 are
preferred.
The required
and preferred
elements contain lists of filter queries that are applied to the search.
For example,geospatial_kosher:true
restricts searches to occurrences that
appear to be geospatially usable.
Only certain ranks get images attached to them.
The ranks
element contains taxon ranks that should have images associated with them,
along with the fields in the occurrence records that allow an occurrence to be found.
The idField
provides a specific biocache field that will be searched for the taxon identifier.
It can be left null for ranks that have no biocache index fields.
The nameField
provides a specific biocache field that will be searched for the taxon name.
Again, it can be left null for ranks with no name.
The lsid
field is always searched, to see if there is an ocurrence record that is specifically
named as the taxon.
The lists
elment contains species lists that manually specify preferred heroic images for species.
The species lists should have a column that supplies the image imageId, so that the image
can be found on the image server. Multiple lists can be used, with highest priority going to the first entry.
The uid
holds the list identifier to load.
Vernacular names can be drawn from species lists on the list server.
The vernacular name configuration can be located in /data/bie-index/config/vernacular-lists.json
and
referenced by vernacularListsUrl
in the configuration properties.
An example vernacular name list configuration is
{
"defaultVernacularNameField": "name",
"defaultNameIdField": "nameID",
"defaultKingdomField": "kingdom",
"defaultPhylumField": "phylum",
"defaultClassField": "class",
"defaultOrderField": "order",
"defaultFamilyField": "family",
"defaultRankField": "rank",
"defaultStatusField": "status",
"defaultLanguageField": "language",
"defaultsourceField": "source",
"defaultTemporalField": "temporal",
"defaultLocationIdField": "locationID",
"defaultLocalityField": "locality",
"defaultCountryCodeField": "countryCode",
"defaultSexField": "sex",
"defaultLifeStageField": "lifeStage",
"defaultIsPluralField": "isPlural",
"defaultIsPreferredField": "isPreferred",
"defaultOrganismPartField": "organismPart",
"defaultLabelsField": "labels",
"defaultTaxonRemarksField": "taxonRemarks",
"defaultProvenanceField": "provenance",
"defaultStatus": "common",
"defaultLanguage": "en",
"lists": [
{
"uid": "drt1464664375273",
"taxonRemarksField": "Notes",
"defaultLanguage": "xul"
"defaultStatus": "traditionalKnowledge"
},
{
"uid": "drt1464664375274",
"defaultLanguage": "fr",
"statusField": "priority"
}
]
}
The default entries provide useful defaults for things like the list fields that hold various pieces of information.
These can be overridden at the list level.
The various fields refer to the fields that can be part of the GBIF vernacular names extension.
The defaultLanguage
and defaultStatus
entries provide per-list defaults for language and status entries.
Languages should be ISO-639 two- or three-letter codes or AIATSIS codes; the bie-plugin can expand these out.
The uid
holds the list identifier to load.
Avoid using vernacularName or commonName as the vernacular name field The list server treats these in a special way, causing problems when attempting to retrieve the names.
Names with a status of deprecated
appear last in lists of names and will not be used as the "headline"
vernacular name.
They are generally names that are now offensive or doubtful.
Conservation status information can also be drawn from species lists.
The conservation configuration can be located in /data/bie-index/config/conservation-lists.json
and
referenced by conservationListsUrl
in the configuration properties.
An example conservation status list configuration is
{
"defaultSourceField": "status",
"defaultKingdomField": "kingdom",
"lists": [
{
"uid": "dr656",
"field": "conservationStatusAUS_s",
"term": "conservationStatusAUS",
"label": "AUS"
},
{
"uid": "dr655",
"field": "conservationStatusVIC_s",
"term": "conservationStatusVIC",
"label": "VIC",
"sourceField": "statusName",
"kingdomField": "kgm"
}
]
}
The uid
supplies the list identifier.
The field
supplies the solr field which will be used to store the conservation status.
The term
supplies the name of the status field.
label
gives the label to apply to the conservation status.
sourceField
gives the name of the field that contains the conservation status.
kingdomField
gives the name of the field that contains the kingdom -- handy for name lookups, if available.
To use all species lists that are recorded as both authoritative and threatened, have lists
as an empty array. These
lists must have a status
column indicating the conservation status.
Calculating weights for search and autosuggest operations gets rather complicated, score-calculated weights for seach operations are built into each document during the import process.
Anything with an idxtype field is annotated with weights.
The weighting rules come from a configuration file, which defaults to
default-weights.json and which can be
set by import.weightConfigUrl
in the configuration.
An example set of weighting rules is
{
"script": "nashorn",
"global": {
"rules": [
{
"term": "taxonomicStatus",
"exists": true,
"rules": [
{
"value": "accepted",
"weight": 2.0
},
{
"value": "misapplied",
"weight": 0.5
}
]
},
]
},
"weights": [
{
"field": "searchWeight",
"rules": []
},
{
"field": "suggestWeight",
"rules": [
{
"term": "scientificName",
"exists": true,
"condition": "_value.length() > 4",
"weightExpression": "_weight * 1.0 / (1.0 + Math.log(_value.length() * 0.01 + 1.0))",
"comment": "The longer the name, the less it should be suggested. Mean name length is 16"
}
]
}
]
}
The top level contains the following fields:
Rules consist of the following entries:
kinmgdom == 'Plantae'
).
If there is a term supplied then the value is supplied to the script as _value
.
If the value is a list, then any matching term will trigger the rule._weight
so that you can do clever things with the weight value.As an example, the above rules, applied to the input
[idxtype:'TAXON', taxonomicStatus:'misapplied', scientificName:'Atrobucca brevis']
and with a start value of 1.0 would give.
The favourites function allows lists from the lists tool to be used to mark taxa or
common names as having a "favourite" status.
The favourite status is a term, such as preferred
or iconic
that can be used to
mark entries for faceting and weight calculation.
The favourites configuation comes from a configuration file, which defaults to default-favourites.json and which
can be set by import.favouritesConfigUrl
in the configuration.
An example favourites configuration is:
{
"defaultTerm": "favourite",
"lists": [
{
"uid": "dr4778",
"termField": "favourite",
"defaultTerm": "interest"
},
{
"uid": "dr781",
"defaultTerm": "iconic"
}
]
}
The top level contains the following entries:
Each list can contain
Favourites only mark selected taxa and their associated common names with favourite terms. Once marked, it is up to the bie-plugin otr weighting rules to make use of these terms.