GlobalNamesArchitecture / gnindex

MIT License
0 stars 0 forks source link

IMPORTANT

This project is archived and will not be developed further. Use GNames <https://github.com/gnames/gnames> and GNverifier <https://github.com/gnames/gnverifier> for the same functionality.

.. image:: https://api.codacy.com/project/badge/Grade/2ad5b690419b4f869add628d7b022a17 :target: https://www.codacy.com/app/alexander-myltsev/gnindex?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=GlobalNamesArchitecture/gnindex&amp;utm_campaign=Badge_Grade

.. image:: https://coveralls.io/repos/github/GlobalNamesArchitecture/gnindex/badge.svg :target: https://coveralls.io/github/GlobalNamesArchitecture/gnindex

gnindex

Brief Info

gnindex is a search engine over collection of strings (combinations of characters) that have been used as names for organisms. gnindex contains many examples of names spelled in slightly different ways. It indexes occurances of biological scientific names in the world, normalizes and reconsiles lexical and taxonomic variants of the name strings.

gnindex handles two type of queries over indexed name strings: resolution and faceted search.

Faceted search accepts part of name string (canonical part, author word(s), genus word(s), species word(s), year) possibly wildcarded or their logically combined composition. In response it returns all results that satisfy given request.

Resolution accepts a set of whole name strings with resolution parameters. Parameters includes:

Installation

gnindex is dockerized:

.. code:: bash

docker-compose up

docker-compose pulls all necessary Docker containers (Postgres database engine, gnindex matcher, nameResolver, facetedSearcher, api and front services), initializes them, and listens for HTTP connections over 80 port with paths as follows:

Using API

gnindex uses GraphQL interface to make requests to API. gnindex accepts GraphQL requests at /api/graphql URL. It's simply a POST request with JSON body of predefined format:

.. code:: json

{ "query": "BODY OF QUERY", "variables": "SUPPLIED VARIABLES", "operation": "SELECTED OPERATION" }

gnindex API exposes GraphQL methods as follows:

Detailed documentation on input parameters and output could also be found in documentation fetched from API and rendered by GraphiQL at /api.

A simple GraphQL request might be:

.. code:: graphql

query { nameResolver(names: [{value: "Homo sapiens"}], dataSourceIds: [1]) { results { name { id value } } } }

That returns result:

.. code:: graphql

{ "data": { "nameResolver": [ { "results": [ { "name": { "id": "7db4f8a2-aafe-56b6-8838-89522c67d9f0", "value": "Homo sapiens Linnaeus, 1758" } } ] } ] } }

The request could be composed with cURL:

.. code:: bash

curl 'http://index.globalnames.org/api/graphql' \ -H 'Content-Type: application/json' \ -H 'Accept: application/json' \ --data-binary '{"query":"query {\n nameResolver(names: [{value: \"Homo sapiens\"}], dataSourceIds: [1]) {\n results {\n name {\n id\n value\n }\n }\n }\n}\n","variables":null}'

Forturately, it isn't necessary to encode GraphQL query string and to nest it to parent JSON. Some languages provide GraphQL clients <http://graphql.org/code/#graphql-clients>_ that do all dirty work. Consider simple Ruby client to request gnindex API:

.. code: ruby

def variables(names) { dataSourceIds: [1], names: names.map { |name| { value: name[:name], suppliedId: name[:id] } } } end

RESOLVER_URL = "http://index-api.globalnames.org/api/graphql" http = GraphQL::Client::HTTP.new(RESOLVER_URL) schema = GraphQL::Client.load_schema(http) @client = GraphQL::Client.new(schema: schema, execute: http) @query = <<~GRAPHQL_QUERY query($names: [name!]!, $dataSourceIds: [Int!]) { nameResolver(names: $names, dataSourceIds: $dataSourceIds) { total suppliedId suppliedInput results { name { value } canonicalName { value } synonym matchType { kind score editDistance } taxonId classification { pathRanks } score { value parsingQuality } } } } GRAPHQL_QUERY names = ["Homo sapiens", "Phallomedusa solida"] res = GRAPHQL.client.query(QUERY, variables: variables(names)) puts res

Project Structure

The projects consists of 6 subprojects:

common


Contains Thrift data structures that are shared between microservices.

matcher

The Thrift microservice expects canonical names of type Seq[String] and data sources IDs of type Seq[Int]. It tries to fuzzy match through all known canonical names and those stems (according to Latin stemming) with Levenstein algorithm of edit distance 1.

It returns list of lists of found fuzzy matches UUIDs: one list per provided canonical name. Note that it returns UUIDs only as it has no connection to database.

nameResolver


The Thrift microservice expects complex requests of type ``thrift.nameresolver.Request``. It passes
through stages for every provided name request:

1. Two UUIDv5 values are computed: first one is for provided name string, second one is for parsed
   canonical name. Then microservice tries to find records according to those two UUIDs
2. If nothing is found for exact match and name is not parsable, then empty result is returned
3. If provided name is parsable, then canonical form of the name goes to ``matcher`` microservice.
   Final results are formed based on database results matched by those UUIDs.

facetedSearcher

Performs faceted search.

api


The ``api`` microservice is connected with ``nameresolver`` and ``facetedSearcher`` microservices.
It provides ``GraphQL`` interface to the user. ``GraphQL`` requests are then translated to
microservices requests.

Development
-----------

Database schema creation

Database is created and migrated using ActiveRecord gem from RubyOnRails package. Database is created in sql format, rather than ruby, because it includes configuration of external Postgresql packages.

To create schema at ./db_migration/db/structure.sql file run bundle exec rake db:migration.

Testing

TODO